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l .  pc.'rosB 

i .  1  SCOPE 


Thi;  report-  discusses  the  work  performed  for  the  U.  S.  Army  Electronics 
Laboratory  (USAEL)  under  Contract  No.  DA- 36-O39-SC-90787  during  the  period 
from  1  July  1$6Q  to  30  June  1964 . 


1.2 


OBJECTIVES 


The  objective  of  this  project  has  been  to  investigate  the  techniques 

:  •  ;  ..  ■■  'U 

and.  concepts  of  information  retrieval  and  to  ..formulate  .and.  develop  e|  gen¬ 
eral  theory  of  information  retrieval.  The  formalization  of  this  theory 

f  .  ■  » 

is  ori&nted  to  the  automation  of  large- capacity  information  storage  |nd 
retrieval  systems.  This  theoretical  framework  is  intended  to  serve  ils  a 


basis  for  the  use  of  general  purpose  stored- program  digital  computer  Jays; 


terns  to  perform  the  storage  and  retrieval  functions. 


i 


1.3 


PROJECT  TASKS 


Thoi  primary  task,  of  this  project  has  been  the  development  of  a  research 

ii 

framework  based  or?,  a  general  system  model  in  which  two  processes  take  place 
simultaneously  and  independently:  the  insertion  of  documents  into  the/ sys¬ 
tem,  and  the  response  to  queries.  A  description  is  attached,  to  each  docu¬ 
ment  as  part  of  the  insertion  process;  most  commonly,  the  description 
tnlccs  the  form  cf  a  list  or  descriptors.  The  descriptions  are  stored  in 


a  file,  together  with  indices  that  permit  back- referencing  to  the  documents 
themselves.  The  file  is  referenced  during  the  processing  of  a  query.  Giver, 
this  model,  the  analysis  can  fco  broken  down  into  four  questions; 
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(a)  How  is  the  descriptive  structure  of  the  retrieval  system  generated 

(b)  How  are  descriptions  assigned  to  dcnupients? 

(c)  How  is  the  file  1.0  ho  structured? 

(d)  How  is  a  query  processed  in  order  to  determine  a  response? 

Each  of  these  questions  has  generated  a  project  task.  The  over-all  frame¬ 
work  is  presented  in  Section  4.1,  and  the  four  questions  are  discussed  in 
Sections  4.2,  4.3>  4.4,  and  4.5,  respectively. 


2.  ABSTRACT 

The  purpose  of  this  report  is  to  present  the  results  of  a  research 
project-  on  information  retrieval.  A  general  system  model  is  presented, 
and  this  model  is  used  to  express  the  problem  of  system  specification  in 
terms  of  four  questions: 

(a)  How  is  the  descriptive  structure  of  the  retrieval  system  generated? 

(b)  How  are  descriptions  assigned  to  documents? 

(c)  How  is  the  file  organized? 

(d)  How  is  a  query  processed  in  order  to  determine  a  response? 

The  treatment  of  these  questions  constitute  the  major  subdivisions  of  the 

\ 

report.  Under  question  (a),  the  economical  assignment  of  descriptors  is 
discussed  and  some  measures  of  accessibility  are  presented)  the  nature 
of  relatednese  of  descriptors  is  also  examined.  Under  question  (b),  the 
principal  topic  is  the  development  of  a  method  for  clue  word  selection  in 
automatic  classification  methods  based  on  word  occurrence)  the  question 
of  automatic  abstracting  is  also  treated  under  this  topic.  Under  question 
(c),  the  relative  efficiency  of  different  types  of'  file  organizations  is 
examined  quantitatively,  and  the  Multi-List  system  is  described  and  analyzed 
Under  question  (d),  the  topics  treated  include  the  development  of  a  method 
of  probabilistic  retrieval  and  a  more  searching  consideration  of  the  prob¬ 
lems  involved  in  retrieval  systems  with  a  high  degree  of  man-machine 


interaction. 


3 •  PUBLICATIONS,  REPORTS,  AND  CONFERENCES 

3.1  PUBLICATIONS 

A  paper  by  Alfred  Trachtenberg  entitled  "Automatic  Document 
Classification  Using  Information  Theoretical  Methods"  was  presented  at 
the  26th  Annual  Meeting  of  tho  American  Documentation  Institute  and 
published  in  the  proceedings  of  that  meeting. 

3.2  PUPPETS 

The  following  reports  were  issued  during  the  period  of  this  contract: 

3.2.1  Monthly  Letter  Reports 

(a)  MONTHLY  LETTER  REPORT  NO.  1,  1  July  1962  -  31  July  1962,  Pile 
No.  P-AA-TR-(0006),  3  August  1962 j  Research  in  Information 

,,  Retrieval.  Alfred  Trachtenberg. 

ti 

(b) ;  MONTHLY  LETTER  REPORT  NO.  2,  1  August  1962  -  31  August  1962, 

File  No.  P-AA-TR-(0009),  31  August  1962 ;  Research  in  Informa¬ 
tion  Retrieval.  Alfred  Trachtenberg. 

(c)  MONTHLY  LETTER  REPORT  NO.  3,  1  October  1962  -  31  October  1962, 
Pile  No.  P-AA-TR-(0012),  31  October  1962 3  Research  in  Informa¬ 
tion  Retrieval,  Alfred  Trachtenberg, 

(d)  MONTHLY  LETTER  REPORT  NO.  4,  1  November  1962  -  30  November  1962, 
File  No.  P-AA-TR-(002fj),  30  November  1962;  Research  in  Infor¬ 
mation  Retrieval.  Alfred  Trachtenberg. 

(a)  MONTHLY  LETTER  REPORT  NO.  3,  1  January  19&3  -  31  January  1963, 
File  No.  P-AA-TR-(0032),  31  January  1963 j  Research  in  Informa¬ 
tion  Retrieval,  Alfred  Trachtenberg. 

(f)  MONTHLY  LETTER  REPORT  NO.  6,  1  February  1963  -  28  February  1963, 
File  No.  P-AA~TR-(0033),  28  February  1963;  Research  in  Informa¬ 
tion  Retrieval,  Alfred  Trachtenberg. 
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(g)  MONTHLY  LETTER  REPORT  NO.  7,  1  April  1963  -  30  April  1963,  File 
No,  P"»AA-TR-(OOU6),  30  April  1963 j  Research  in  Information 
Retrieval ,  George  Greenberg, 

(h)  MONTHLY  LETTER  REPORT  NO.  8,  1  May  1963  -  31  May  1963,  File  No. 
P-AA-TR-(00l|8),  31  May  19635  Research  in  Information  Retrieval, 
George  Greenberg. 

(i)  MONTHLY  LETTER  REPORT  NO.  9,  1  July  1963  -  31  July  1963,  File 
No.  3201 -TR -0039,  31  July  1963}  Research  in  Information 
Retrieval ,  George  Greenberg. 

(,■))  MONTHLY  LETTER  REPORT  NO,  10,  1  August  1963  -  31  August  1963, 

File  No,.  3201-TR-0063  ,  31  August  1963;  Research  in  Information.' 
Retrieval,  George  Greenberg. 

(k)  MONTHLY  LETTER  REPORT  NO.  11,  1  October  1963  -  31  October  1963, 
File  Noo  3201-TR-0070,  31  October  1963;  Research  in  Information 
Retrieval,  George  Greenberg. 

(l)  MONTHLY  LETTER  REPORT  NO.  12,  1  November  1963  -  30  November  1963, 
Pile  No,  3201-TR-0073,  30  November  1963}  Research  in  Informa¬ 
tion  Retrieval,  Paul  W,  Abrahams , 

(m)  MONTHLY  LETTER  REPORT  NO.  13,  1  January.  196U  -  31  January  19&, 
File  No.  3201-TR-0079,  31  January  1961} ;  Research  in  Information 
Retrieval,  Paul  W.  Abrahams, 

(n)  MONTHLY  LETTER  REPORT  NO.  ll|.,  1  February  1961;  -  29  February  1961;, 
File  No,  3201-TR-0081,  2  March  1961;}  Research  in  Information 
Retrieval,  Paul  ¥.  Abrahams , 


3.2,2  Quarterly  Progress  Reports 


(a)  RESEARCH  IN  INFORMATION  RETRIEVAL;  First  Quarterly  Report. 

1  July  1962  -  30  September  1962,  Technical  Report  P-AA-TR-(OOIO), 
30  October  1962. 
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(b)  RESEARCH  IN  INFORMATION  RETRIEVAL :  Second  Quarterly  Report. 

1  October  1962  -  31  December  1962,  Technical  Report  P-AA-TO-  (00.31), 
31  Januaiy  1963. 

( c )  RESEARCH  IN  INFORMATION  RETRIEVAL s  Third  Quarterly  Report, 

1  January  1963  -  31  March  1963,  Technical  Report  P-AA-TR-(0044)> 

30  April  1963. 

(d)  RESEARCH  IN  INFORMATION  RETRIEVAL:  Fourth  Quarterly  Report. 

1  April  1963  -  30  June  1963,  Technical  Report  5201-TR-0038, 

31  July  1963. 

(e)  RESEARCH  IN  INFORMATION  RETRIEVAL;  Fifth  Quarterly  Report. 

1  July  1963  -  30  September  1963,  Technical  Report  3201-TR-0069, 

31  October  1963 > 

5  /  , 

(£)  RESEARCH  IN  INFORMATION  RETRIEVAL  1  Sixth  Quarterly  Report. 

1  October  1963  -  31  December  1963,  Teohnioal  Report  5201-TR-0078, 

31  January  I96U. 

(g)  RESEARCH  IN  INFORMATION  RETRIEVAL:  Seventh  Quarterly  Report, 

1  Jammy  1964  -  31  March  1964,  Technical  Report  3201-TR-0088, 

30  April  1964.  :  '  =  " 

3.3  CONFERENCES 

' 

3.3.1  Conferences  with  USAEL  Personnel 

The  following  conferences  were  held  between  DISD  personnel  and  USAEL 
personnel; 

(a)  $  July  1962—  Meeting  at  DISD.  Discussions  of  objectives  and 
plans  for  the  research  activity  were  initiated.  The  formula¬ 
tion  of  a  method  of  approach  was  requested  for  presentation  at 
the  next  meeting. 

(b)  17  July  1962=- Meeting  at  DISD.  A  technical  note  prepared  by  the 
project  staff  was  used  as  the  basis  of  discussions  pertaining  to 
the  scope,  development  phases,  alternative  plans,  and  recommended 
direction  for  the  project. 


7 


(c)  18  July  1962— Meeting  at  DISD.  Informal  discussion  of  Signal 
Corps  objectives  and  goals  for  research  activity , 

(d)  9  August  1962—  Meeting  at  Fort  Monmouth,  New  Jersey,  Discussions 
■were  held  concerning  the  functional  characteristics  of  informa¬ 
tion  retrieval  systems .  No  particular  area  of  activity  was 
selected  for  further  study, 

(e) .  10  September  1962 — Meeting  at  DISD.  Several  methods  of  relating 

"■  descriptor  systems  in  a  generalized  sense  were  discussed  in  rela¬ 
tion  to  the  requirements  for  a  file  structure.  The  analysis  and 
development  of  a  general  theory  was  recommended  as  the  objective 
of  the  project, 

(f)  29  November  1962 —Meeting  at  DISD,-  DISD  personnel  were  intro¬ 
duced  to  Mr.  Anthony  V,  Campi,  who  had  recently  been  assigned 
as  Project  Engineer.  Several  aspects  of  the  First  Quarterly 
Report  were  discussed,  and  the  concepts  pertaining  to  measure 
of  relevance  were  clarified.  DISD  aooepted  the  suggestion  that 

the  discussion  in  the  report  should  be  elaborated  in  more  detail. 

'»  ,  ...  .  ... 

(g)  28  February  1?63--Meeting  at  DISD.  DISD  personnel  met  with 

Mr,  Anthony  V.  Campi,  who  had  recently  been  assigned  as  Project 
Engineer.  Several  aspects  of  the  Second  Quarterly  Report  were 
discussed.  A  few  minor  corrections  and  elaborations  were 
requested,  and  a  general  emphasis  on  the  importance  of  user 
requirements  was  indicated, 

(h)  25  April  1963--Meeting  at  DISD.  Mir,  David  Haretz  and 

Mr,  Larry  Sarlo  Conferred  with  project  personnel  on  the  general 
impact  and  significance  of  the  report  on  scientific  information 
prepared  by  the  President's  Science  Advisory  Committee.  This 
report  is  entitled  Science,  Government,  and  Information, 

(i)  6  June  1963 — Meeting  at  DISD,  Lt.  Fred  Hill  and  Mr.  Larry  Sarlo 
conferred  with  project  personnel  about  the  manuscript  version 

of  the  Third  Quarterly  Report,  Difficult  concepts  were  explained, 
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and  questions  were  discussed,  Several  suggested  changes  were 
accepted  for  inclusion  in  the  published  fom,  The  plans  for 
the  current  quarter  and  for  future  activity  were  also  discussed, 

(j)  2  August  1963 --Meeting  at  DISD.  Mr,  Larry  Sarlo  and  Lt.  Fred  Hill 
were  briefed  on  progress  made  during  the  fourth  quarter  of  the 
information  retrieval  project.  Researchers  presented  aspects  of 
their  work  during  the  quarter  which  were  included  in  the  Fourth 
Quarterly  Report,  Plans  for  the  fifth  quarter  and  future  activity 
were  also  discussed. 

(k)  21  October  1963— Meeting  at  DISD.  Mr.  David  Haretz  and 

Mr.  Larry  Sarlo  conferred  with  project  personnel  to  review  the 
first  draft  of  the  Fourth  Quarterly  Report.  The  report  was 
reviewed  in  detail,  and  some  ooncepts  relating  to  the  Multi- 
List  system  were  clarified. 

(l)  20  November  1963— Meeting  at  USAEL.  Project  personnel  conferred 

1 

with  Mr.  David  Haretz,  Mr.  Lariy  Sarlo,  and  Mr.  Seraflno  Amoroso. 
Several  problems  were  discussed  and  settled,  and  some  of  the 
difficulties  of  system  integration  were  examined. 

(m)  1+  March  1961; — Meeting  at  DISD.  Mr,  Larry  Sarlo  of  USAEL  reviewed 
the  firr  t  draft  of  the  Sixth  Quarterly  Report.  Several  minor 
corrections  were  made,  and  some  technical  difficulties  were 
clarified, 

(n)  25  June  1961; —Meeting  at  DISD.  A  discussion  was  held  between 
project  personnel  and  Mr,  David  Haretz,  Mr.  Anthony  V,  Campi, 
and  Mr.  David  Hadden,  Jr.,  of  USAEL,  The  current  status  and 
accomplishments  of  the  project  were  discussed,  and  the  content 
of  the  final  report  was  considered. 

3 . 3  o 2  Other  Conferences 

During  the  term  of  this  project,  various  project  personnel  attended 
conferences  relating  to  information  retrieval.  Attendance  at  these 
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conferences  was  sponsored,  by  DISDj  the  knowledge  gained  was  of  considerable 
help  in  pursuing  specific  research  area.s  within  this  project. 

(a)  3  December  1962  -  7  December  1962 --Mathematics  of  Information 
Storage  and  Retrieval.  Quentin  A.  Darmstadt  attended  this  con¬ 
ference,  which  was  conducted  by  Dr.  Robert  M.  Hayes  under  the 
auspices  of  the  Georgia  Institute  of  Technology. 

During  this  period  several  ancillary  conferences  were  also  attended: 

(b)  2  May  1963— NASA  Scientific  and  Technical  Information  Conference. 
This  oonferenoe  was  held  in  Atlanta,  Georgia,  and  was  attended 
by  George  Greenberg.  The  conference  presented  NASA's  methods 
and  techniques  for  acquiring,  processing,  storing,  disseminating, 
and  retrieving  information. 

(c)  17  June  1963  —Simulation  of  Cognitive  Processes.  This  seminar 
was  conducted  for  six  weeks  at  the  RAND  Corporation.  Its  purpose 
was  to  discuss  the  problems  of  information  systems.  George 
Greenberg  was  an  invited  participant.  During  the  time  spent  at 
the  seminar  Dr.  Greenberg  had  the  opportunity  to  discuss  the 
problems  of  information  retrieval  with  several  other  researoh 
organizations. 

(d)  6  October  1963  -  11,  October  1963— 26th  Annual  Meeting  of  the 
Amerioan  Documentation  Institute.  This  meeting  was  attended  by 
Jacques  Harlow  and  Alfred  Trachtenberg;  Mr*  Trachtenberg  pre¬ 
sented  some  of  the  results  of  the  project  in  an  invited  paper 
at  the  conference. 

(e)  19  February  1961; — Meeting  with  Dr.  Harold  Borko.  Paul  Abrahams 
met  with  Dr.  Borko  at  the  System  Development  Corporation  in 
Santa  Monica,  California,  The  research  carried  on  under  this 
contract  was  discussed,  and  Dr.  Borko  offered  a  number  of 
helpful  suggestions. 
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U.  FACTUAL  DATA 


U .  1  STATEMENT  OF  THE,  PROBLEM 

11.1.1  Original  Formulation  -•  The  technical  requirement  of  the  Signal 
Corps,  as  specified  in  SCL-lj.3J>J>,  is  for  "...a  research  investigation  of 
techniques  and  concepts  necessary  for  the  efficient  mechanization  of 
large-capacity  information  storage  and  retrieval  systems."  Among  the 
applied  objectives  suggested  as  guides  for  such  research  are  "...problems 
of  military  significance j  i,e.,  personnel  files,  intelligence  data,  etc." 

14.1.2  System  Model  and  Definitions  -  The  purpose  of  an  information 
storage  and  retrieval  system  is  to  record  a  body  of  information  and  to 
provide  to  a  group  of  users  a  means  of  answering  questions  pertaining  to 
this  Information.  The  information  is  ordinarily  provided  in  the  form  of 
a  discrete  set  of  documents,  such  as  books,  parts  listings,  personnel 
records,  or  newspaper  articles.  Information  retrieval  systems  may  be 
either  document  retrieval  systems  or  content  retrieval  systems a  doc¬ 
ument  retrieval .system  responds  to  a  query  with  a  set  of  documents  that  1 
are  relevant  to  the  user's  question,  while  a  content  retrieval  system 
provides  the  actual  answer  to  the  question.  Document  retrieval  systems 
may  further  be  subdivided  into  those  systems  that  provide  the  actual 
documents  and  those  that  merely  tell,  where  the  documents  are  located. 

Most  of  the  research  described  in  tills  report  has  been  concerned  with 
document  retrieval  systems  that  provide  the  locations  of  documents  rather 
than  tho  documents  themselves ,  In  order  to  clarify  the  terminology,  it 
will  be  helpful  to  present  a.  generalized  model  of  how  such  systems  operate. 
A  diagram  of  this  model  .is  shown  in  Figure  1.  There  are  two  major 
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Response 

FIGURE  1.  Basic  System  Model 


processes  taking  place  in  the  system:  the  incorporation  of  documents 
into  a  filq.  and  the  response  to  queries .  These  processes  take  place 
asynchronously.  A  query  as  we  use  it  here  is  not  quite  the  same  thing 
as  a  question}  a  question  is  the  user's  own  description  of  the  infer-  . 
nation  he  needs,  while  a  query  is  in  a  form  that  the  system  can  operate 
upon  and  respond  to.  Questions  may  be  vague  and  formless;  queries  must 
be  specific  and  formal. 

Associated  with  each  document  stored  in  the  system  are  an  index  and 
a  description.  The  index  specifies  either  directly  or  indirectly  where 
the  document  is  physically  stored.  (For  instance,  the  personnel  file  of 
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an  employee  can  be  located  physically  if  the  employee's  name  is  known, 
or  even  if  only  the  serial  number  of  a  card  giving  his  name  is  known.) 

The  description  relates  to  the  content  of  the  document,  and  consists  of 
that  information  about  the  document  that  is  available  for  matching  against 
queries.  The  file  contains  the  index  and  description  of  each  available 
document.  The  query  processor  operates  on  queries,  making  use  of  the 
file,  to  produce  the  indices  of  those  documents  that  are  responsive  to 
the  query.  The  file  should  really  be  thought  of  as  an  integral  part  of 
the  query  processor.  \ 

,  '  j 

When  a  document  is  entered  into  the  system  it  is  prjsented  to  a 
classifier  that  generates  the  description  of  the  document.  The  output 
of  the  classifier  is  then  paired  with  the  ijjidex  of  the  document  and  stored 
in  the  file.  A  variation  on  this  configuration  is  to  have  the  index 

derived  from  the  description,’  the  ordinary  library  follows  this  prooe- 

'!  ;  ,  ; 

dure,  since  the  physical  location  of  a  book  depends  on  its  description. 

These  concepts  can  be  clarified  by  means  of  a  simple  example.  Con¬ 
sider  a  library  of  technical  Journals.  Since  eaoh  Journal  may  contain 
several  unrelated  articles,  each  article  is  treated  as  a  separata  document. 
The  index  of  each  doeument  is  the  Journal  name,  volume  number,  and  page 
number.  The  librarian  records,  for  each  document,  a  list  of  subject  head¬ 
ings  that  describe  the  document;  this  list  is  the  document  description. 

A  separate  card  is  made  up  for  each  appropriate  subject  heading,  listing 
and  subject  heading  and  the  document  index.  The  file  consists  of  the 
subject  cards  for  all  the  available  documents.  If  the  cards  are  stored 
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alphabetically  by  subject,  then  a  query  consists  of  the  name  of  a  single 
subject,  and  the  processing  of  a  query  consists  of  locating  the  set  of 
cards  for  that  subject.  The  response  is  the  set  of  index  numbers  listed 
on  the  cards.  Of  course,  the  system  of  this  example  will  not  be  partic¬ 
ularly  effective,  but  it  does  serve  to  illustrate  the  concepts. 

The  most  common  form  of  document  description  consists  of  a  list  of 
descriptors  such  as  the  subject  headings  of  the  previous  example.  The 
descriptors  may  have  additional  information  associated  with  them,  or  they 
may  be  related  to  one  another  in  rather  complex  ways.  It  should  be  empha¬ 
sized  that  the  descriptor  list  Is  not  the  only  possible  form  of  a  document 

.  >  , 

description. 

Various  modifications  of  the  model  in  Figure  1  are  possible.  One 
such  modification  is  to  have  the  index  as  an  output  of  the  classification 
process  rather  than  having  it  be  independent  ‘of  that  process.  A  different 
variation  would  be  a  system  that  produced  documents  rather  than  indices 
In  response  to  queries.  In  such  a  system  the  description  of  a  document 
would  bis  the  document  itself.  The  document-- or  significant  and  definable 
parts  of  a  document--would  then  absorb  the  function  of  the  index  when  the 
query  capabilities  were  activated. 

It  is  also  possible  to  conceive  of  query  capabilities  with  the  ability 
to  retrieve  only  the  relevant  portions  of  documents.  At  the  least  sophis¬ 
ticated  level  this  variation  simply  involves  refining  the  organization  of 
the  total  data  so  tte.t  a  larger  number  of  functional  documents  Is  avail¬ 
able  for  output.  This  procedure  could  be  achieved  by  applying  the  same 
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processes  to  small  subunits  of  conventional  documents.  Finally,  the  system 
may  be  able  to  produce  responses  to  queries  that  are  neither  documents  nor 
portions  of  documents  hut  responses  derived  from  the  information  contained 
in  various  documents.  Such  a  system,  must  have  the  capacity  to  perform 
inferential  processing  on  the  content^  of  the  documents.  This  type  of  sys¬ 
tem  would  correspond  to  a  content  retrieval  system. 

4.1.3  Problem  Formulation  and  Task  Breakdown  -  A  model  may  describe 
the  operation  of  an  information  retrieval  system;  but  in  order  to  develop 
an  operating  system,  questions  relating  to  the  requirements  of  an  Informa¬ 
tion  retrieval  system  must  first  be  answered. 

An  analysis  of  the  system  model  given  in  Figure  1  leads  to  the  break¬ 
down  of  the  problem  into  four  taaks.  The  form  of  the  descriptions  that 
are  transmitted  from  the  classifier  to  the  file  must  be  defined;  in  addi¬ 
tion,  the  three  major  system  components- -the  classifier,  the  file,  and 
the  query  processor- -must  be  specified.  In  order  to  account  for  the 
static  as  well  as  the  dynamic  aspects  of  information  retrieval,  these 
requirements  may  be  expressed  in  terms  of  four  questions: 

(a)  How  is  the  descriptive  structure  of  the  retrieval  system  generated? 

(b)  How  are  descriptions  assigned  to  documents? 

(c)  How  is  the  file  to  be  structured? 

(d)  How  is  a  query  processed  in  order  to  determine  a  response? 

Although  the  answers  to  each  of  these  questions  are  Interdependent,  it  is 
still  possible  to  consider  each  of  them  separately.  Question  (a)  must  be 
answered  first  since  the  very  definitions  of  the  other  questions  depend 


upon  it.  For  instance,  it  is  impossible  to  talk  about  the  assignment  of 
descriptions  to  documents  until  the  class  of  possible  descriptions  has 
been  settled  upon.  Since  a  good  file  organization  will  be  based  upon  the 
descriptive  structure  in  use,  question  (a)  must  be  answered  before  ques¬ 
tion  (c)  can  be  considered.  Question  (d),  in  turn,  depends  upon  question 

(o)  since  the  file  is  an  integral  part  of  the  query  processor. 

/  " 

In  considering  question  (4)  it  must  be  recognized  that  the  descriptive 
structure  of  a  retrieval  system  will  depend  upon  the  particular  corpus  of 

lt> 

information  that  it  is  to  operate  upon.  It  Is  the  method  of  generating 

./  .  i 

descriptions  rather  than  the  descriptions  themselves  that  are  invariant 
from  one  corpus  to  another.  Furthermore,  the  class  of  possible  descrip¬ 
tions  may  itself  vary  with  time  as  new  types  of  documents  are  introduced 
into  the  system  and  rarely  used  ones  dropped  out. 

'I 

Questions  (a)  and  (a)  nay  be  regarded  as  concerned  with  the  static 
aspects  of  a  retrieval  system,  while  questions  (b)  and  (d)  deal  with  the 

i;  I. 

dynamic  aspects.  The' descriptive  structure  and  file  structure  are  usually 
fixed  before  the  system  becomes  operational  and  aro  modified,  at  worst, 
at  a  slow  rate  thereafter.  The  assignment  of  descriptions  to  dooumentB 
and  the  answering  of  queries,  on  the  other  hand,  are  on-going  processes. 

In  order  to  clarify  these  questions,  each  of  them  will  be  discussed 
in  greater  detail  in  the  following  section. 

4.1.4  Explication  of  the  System  Requirements 

4.1.4. 1  Descriptive  Structures  -  Mont  descriptive  structures 
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are  baaed  on  the  use  of  descriptors.  Descriptors  are  introduced  into 
information  retrieval  systems  in  order  to  reduce  the  language  recogni¬ 
tion  and  transformation  requirements  and  to  reduce  the  complexity  of  the 
data  structures  or  content  relationships.  In  short,  descriptors  repre¬ 
sent  an  artificially  restricted  standard  language  used  to  increase  the 
convenience  of  handling  requests,  constructing  and  organizing  files, 
and  searching  for  answers. 

One  of  the  major  problems  in  constructing  a  descriptor. system 
is  the  proper  selection  of  the  descriptors  that  are  class  names  for 
synonyms  so  as  to  maximize  retrieval  of  relevant  information  and  to 
minimize  noise,  the  retrieval  of  irrelevant  data.  The  descriptors  must 
be  words  in  common  use,  as  unambiguous  as  possible,  and  sufficiently 
numerous  to  delineate  relatively  fine  distinctions.  Obviously,  the 
more  documents  filed  under  a  given  descriptor,  the  larger  the  noise  is 
likely  to  be. 

To  increase  the  number  of  relevant  documents  retrieved  in 
response  to  a  given  request,  descriptors  for  the  request  can  be  weighted. 
These  weights  can  be  assigned  according  to  the  relevance  and  t,he  impor¬ 
tance  of  the  particular  descriptor  under  consideration.  The  system  can 
then  produce  responses  ordered  according  to  weights  assigned  descriptors 
or  responses  greater  than  n  fixed  weight  of  relevance  and  importance. 
Another  scheme  for  reducing  irrelevance  in  responses  is  to  assign  deocrip' 
tors  to  each  section  of  documents  added  to  the  file.  This  method,  of 
course,  increases  the  degree  of  content  retrieval. 
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Increasing  the  flexibility  of  descriptors  by  introducing  role 
indicators  or  specifying  terms  as  actions,  relations,  results,  means, 
pui'pose,  or  locations  is  a  further  step  toward  content  retrieval  in  the 
sense  that  it  is  the  beginning  of  syntactical  and  semantic  specification 
of  request  terms. 

k. 1.4. 2  Assignment  of  Descriptions  to  Documents  -  If  the 
selected  form  of  description  for  documents  is  the  descriptor  list,  then 
the  simplest  method  of  classification  would  be  simply  to  assign  to  a  doc¬ 
ument  those  descriptors  that  occurred  within  its  title.  This  rule  is  the 
basis  of  the  quite  popular  KWIC  indexing  system.  Its  defect  is  that  a 
descriptor  must  have  associated  with  it  a  large  number  of  synonyms,  sinoe 
the  occurrence  of  the  intended  descriptor  in  a  title  is  usually  rather 
unlikely. 

[  , 

More  elaborate  classification  schemes  can  be  based  upon  the 

occurrence  of  words  other  than  the  descriptors  themselves  within  either 
the  title,  the  abstract,  or  the  text  of  a  document.  These  methods  are 
also  capable  of  generalization  to  account  for  word  frequency  as  well  as 
word  occurrence  information  and  to  assign  different  weights  to  words 
according  to  their  relevance  to  the  category.  Such  approaches  are  par¬ 
ticularly  amenable  to  automatic  classification;  their  defect  is  that 
they  cannot  be  quite  so  readily  adapted  to  descriptions  more  complicated 
than  the  simple  descriptor  list. 

For  more  complicated  kinds  of  descriptions,  such  as  descriptors 
interrelated  through  the  use  of  connectives,  more  sophisticated  textual 
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analyses  are  necessary.  Word  occurrences  can  still  be  used  as  s.ids  in 
locating  key  sentences  within  the  document,  but  for  this  type  of  classi¬ 
fication  the  use  of  syntactic  analyzers  is  probably  unavoidable. 

4. 1.4. 3  Organization  and  Structure  of  Piles  -  If  information 
retrieval  is  viewed  generally,  it  can  be  defined  as  locating  and  present¬ 
ing  a  specific  informative  and  accurate  answer  or  piece  of  information 
in  response  to  a  specific  question.  Accomplishing  this  function  requires 

a  classification  scheme  that  groups  larger  units  of  related  information; 

\>  1 

e,g.,  documents  or  sections  of  documents.  Descriptors  are  assigned  to 
units  of  information.  Th p.  file  consists  of  the  Bystem  of  descriptors 
and  of  Information  units  ordered  in  some  fashion  to  Indicate  the  rela¬ 
tions  between  descriptors  and  ini >rmation.  Generally,  a  descriptor  is 

•j  ''  ‘  ■ 

associated  with  many  units  of  information  and  a  unit  of  information  may 
be  described  by  several  descriptors.  In  addition,  the  file  structure 

i! 
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must  provide  for  relations  among  information  -units  and  among’  descriptors. 

One  of  the  best  known  Bystems  that  can  he  used  to  relate  descrip¬ 
tors  ia  the  hierarchical  classification  or  tree  structure  originally 
developed  for  biological  classification.  This  ty  of  structure  forms 
a  Boolean  algebra  under  the  red. at  ion  of  class  .inclusion.  This  model  is 
only  appropriate  for  a  limited  field  of  information  in  which  a  class  is 
Immediately  subordinate  to  only  one  other  class.  This  restriction 
requires  a  breakdown  into  small  units  of  information,  which  means  that 
the  descriptor  f j le  would  be  composed  of  a  large  number  of  hierarchies 
of  class  inclusion.  (The  Multi-List  system  is  a  device  for  circumventing 


the  limitations  of  ordinary  list  processing  or  hierarchies  by  allowing 
for  relations  among  branches.) 

For  information  fields  of  some  diversity,,  the  relations  among 
descriptors  usually  form  complicated  networks  to  which  the  tree  theory  is 

not  directly  applicable.  A  general  model  of  a  complicated  descriptor 

‘11 

network  is  represented  by  meanB  of  a  complemented  modular  lattice.  This 
model  is  of  sufficient  generality  to  cover  a  wide  variety  of  situations. 
Most  elements  afe  multiply  connected  rather  than  singly  connected  as  in 

.  ••  .  it-,  ..  - 

a  tree.  The  lattice  model  is  referred  to  aB  a  weak  hierarchy— an  ele¬ 
ment  may  have  more  than  one  predecessor.  The  tree  is  a  strong  hierarchy- 
an  element  has  only  one  predecessor.  The  prinoipal  problem  with  the 
lattice  model  is  that  the  nuniber  of  nodes  in  the  network  quickly  reaches 
into  the  millions  if  all  relations  between  descriptors  are  represented. 
Consequently,  the  problem  becomes  one  of  effectively ^limiting  the  number 
of  relations  represented  among  descriptors. 

The  descriptor  file  associates  descriptors  with  information 
units  or  items  of  data.  These  associations  can  be  represented  by  a 
matrix  of  ones  and  zeros,  where  descriptors  may  be  ordered  as  rows  and 
Information  units  as  columns.  A  one  indicates  a  relation;  a  zero,  none. 
For  a  rich  information  store,  this  matrix  will  be  large  and  most  of  its 
elements  will  be  zeros.  It  is,  therefore,  an  uneconomical  representation 
The  matrix  can  be  compressed  by  listing  rows  or  columns  (descriptors  or 
data)  and  related  items  only  for  each  entry.  Of  course,  access  to  the 
file  Is  much  simpler  for  descriptor  entry.  Search  time  for  these  types 
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of  files  can  be  reduced  by  using  multiple  entry  of  terms  or  by  an  ordered 
arrangement  of  both  descriptors  and  data.  Generic  relations  among  terms 
can  be  shown  by  direct  cross  references,  carried  with  each  descriptor, 
or  by  a  code  of  hierarchical  class  numbers  showing  the  generic  structure 
of  the  terms.  '  f' 

4. 1.4,4  Query  Response  -  In  a  retrieval  system  based  upon 

■  v 

descriptors  there  are  two  requirements  for  effective  response  to  queries. 
The  first  is  the  transformation  of  the  query  into  the  standard  search 
terms.  The  second  is  the  particular  strategy  or  methodology  for  search¬ 
ing  the  descriptor  file  effectively  and  fruitfully. 

Transforming  a  query  into  standard  descriptor  terms  is  basically 
a  form  of  translation  from  a  rich  language  into  a  summary  language  or  the 
matching  of  two  sets  of  terms,  one  large,  the  other  smaller. In  order  to 
accomplish  this  transformation,  the  meaning  and  relations  between  terms 
of  the  two  sets  or  languages  must  be  understood.  Aid  may  be  provided  in 
the  form  of  a  dictionary  or  glossary  of  subject  ma.tter.  The  knowledge 
required  to  transform  requests  into  descriptors  1b  most,  simply  provided 
to  a  computer  by  furnishing  it  with  a  thesaurus.  Any  more  sophisticated 
means  would  involve  a  considerable  capability  for  linguistic  transforma¬ 
tion  on  the  part  of  the  computer. 

The  formulation  of  a  query  and  its  transformation  into  a  limited 
set  of  descriptors  often  does  not  provide  sufficient  Information  and 
direction  to  obtain  exhaustive  information  concerning  a  subject  that  may 
exist  in  the  data  file.  Effective  search  procedures  are  closely  related 
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to  the  way  in  which  the  descriptor  file  is  structured  and  what  sort  of 
relations  are  indicated  there.  The  most  common  method  of  searching  is 
the  conjunctive  search,  which  retrieves  only  that  information  related 
to  or  encompassed  by  all  the  request  descriptors  in  conjunction.  It  is 
also  possible  to  construct  search  procedures  in  terms  of  logical  sums, 
differences,  complements,  and  more  complicated  combinations  of  theijie  func¬ 
tions  as  well  as  weighted  logical  functions  in  terms  of  set  densities.11 

■  '  4.1.5  Relation  to  Specific  Problems  -  In  the  following  three  sub- 
sections  the  four  questions  posed  in  Section  4.1.3  will  be  examined  in 
relation  to  three  specific  information  retrieval  problems:  personnel 
files,  literature,  and  intelligence  information.  These  three  problems 
will  be  examined  in  Increasing  order  of  difficulty. 

4. 1.5.1  Personnel  FlleB  -  In  extracting  information  from 
personnel  files,  the  critical  questions  are  (c)  and  (d),  namely,  file 
structure  and  response  to  queries.  Each  personnel  record  will  normally 
be  composed  of  a  set  of  fields,  each  giving  Borne  characteristic  of  the 
individual  person.  Some  of  these' fields  may  be  variable  in  length;  e.g., 
there  may  be  a  field  listing  the  age  and  sex  of  all  dependent  children. 

The  descriptive  structure  of  such  a  file  is  trivial,  since  the  descrip¬ 
tion  of  any  document  (i.e.,  individual  record)  is  simply  that  subset  of 
the  fields  that  may  be  used  for  retrieval  purposes.  The  process  of 
assigning  descriptions  to  documents  is  nothing  more  than  deletion  followed 
by  straightforward  encoding. 


The  file  structure  problem  in  this  case  concerns  the  specific 


device  used  to  store  the  information  and  the  arrangement  of  the  items 
within  this  device.  For  instance,  it  may  be  possible  physically  to 
string  together  those  items  that  possess  a  common  characteristic j  this 
technique  is  effectively  what  is  done  by  the  Multi- List  system  (c.f., 
Section  4.4.2).  On  the  other  hand,  items  may  be  placed  in  a  special 
order,  with  appropriate  indexing  systems.  Query  processing  consists  of 
nothing  more  than  matching,  but  the  mechanization  of  this  matching  may 
be  quite  complex  and  will  certainly  be  closely  related  to  the  file 
organization.  For  personnel  files  the  problem  of  deciding  whether  a 
particular  document  is  responsive  to  a  particular  query  is  quite  trivial. 

4. 1.5 *2  Literature  Retrieval  -  In  a  literature  retrieval  Bys¬ 
tem,  unlike  the  personnel  file,  the  problem  of  selecting  a  descriptive 
structure  and  then  of  classifying  documents  1b  no  longer  trivial. 
Furthermore,  the  question  of  whether  or  not  a  particular  document  is 
responsive  to  a  particular  query  cannot  be  answered  with  certainty  but 
only  with  probability. 

The  most  common  form  of  description  for  literature  retrieval 
systems  is  the  descriptor  list.  In  this  case  the  choice  ef  descriptors 
becomes  critical,  since  the  descriptors  are  used  both  for  classification 
and  for  querying.  The  particular  descriptors  used  will  depend  on  the 
subject,  matter  of  the  literature  being  classified,  although  the  nature 
of  the  interrelation  may  ns  subject- independent  if  the  descriptors 
within  a  description  are  interrelated. 

Riven  a  set.  of  descriptors,  the  problem  of  classifying  documents 
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is  still  quite  difficult.  An  approach  to  this  problem  that  utilized  the 
occurrence  of  clue  words  is  discussed  in  this  report.  A  complicating 
factor  is  the  difference  between  the  use  of  a  descriptor  in  a  document 
and  the  meaning  of  that  descriptor  as  understood  by  a  user  of  the  system. 
If  these  meanings  are  divergent,  then  poor  system  performance  may  result. 

The  problem  of  file  structure  and  query  response  in  a  literature 
retrieval  system  is  similar  to  that  of  a  personnel  file.  Once  descriptors 
have  been  assigned  to  documents,  the  process  of  answering  a  query  is  again 
purely  a  matching  process.  The  guesswork  occurs  not  in  the  response  to 
queries  but  in  the  classification. 

\\  .  . 

4. 1.5. 3  Intelligence  Information  -  In  retrieving  intelligence 
information  all  the  difficulties  that  exist  in  literature]  retrieval 
are  retained,  but  in  addition  the  problem  of  query  processing  is  no  longer 
merely  a  matter  of  matohing.  In  Its  more  elaborate  forms,  in  fact,  intel¬ 
ligence  information  processing  really  requires  the  use  of  implicit  infor¬ 
mation  retrieval  techniques.  On  a  lesser  level,  it  may  still  be  necessary 
to  consider  the  interrelationships  of  different  items  of  data  in  order  to 
decide  which  ones  are  to  be  provided  in  the  response  to  a  query.  Items 
that  are  useless  by  themselves  may  become  useful  as  part  Of  a  chain  of 
related  events. 

Processing  of  intelligence  information  will  almost  certainly 
require  the  use  of  syntactic  and  semantic  analysis.  For  information  of 
this  type  it  is  virtually  impossible  for  a  system  to  respond  to  queries 
unless  it  is  capable  of  extracting  the  meaning  of  a  sentence  or  a  document 
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Terms  or  interest  will  ordinarily  occur  far  to  frequently  within  the  corpus 
of  information  for  mere  word  occurrence  or  frequency  data  to  be  particularly 
helpful  in  isolating  salient  data.  In  addition,  much  of  the  required  out¬ 
put  will  be  useful  only  when  presented  in  appropriate  epm'oi nation. 

A  further  salient  aspect  of  intelligence  infoirmatlon  processing 
is  that  ordinarily  one  would  expect  to  ask  many  queries  in  order  to 
answer  a  question. /  Thus  there  exists  a  feedback  relationship  between 
the  system  rind  the  user,  in  which  each  query  is  largely  determined  by 
the  response  to  the  last  one.  The  structure  of  the  query  processor, 
and  consequently  of  the  query  language,,  muBt  be  constructed  to  account 
for  this  feedback  relationship. 

4.2  DESCRIPTIVE  STRUCTURE  OF  RETRIEVAL  SYSTEMS 

In  an  information  retrieval  system  a  description  is  attached  to  each 
document,  and  this  description  represents  all  the  information  about  the 
document  that  is  available  to  the  system  for  retrieval  purposes.  The 
descriptive  structure  of  the  system  Is  concerned  with  the  class  of  pos-  ■ 
sible  descriptions,  but  not  with  how  descriptions  are  actually  assigned 
to  documents.  Thb  descriptive  systems  examined  within  the  scope  of  this 
project,  with  the  exception  of  the  material  on  automatic  abstracting,  have 
assumed  that  there  exi3ts  a  set  of  descriptors  from  which  descriptions 
arc  constructed.  Three  key  questions  then  remain: 

(a.)  How  are  the  descriptors  to  be  selected? 

(b)  What  infonnatlon  is  to  be  attached  to  a  descriptor? 

(c)  How  are  several  descriptors  in  a  description  to  be  related? 
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In  dealing  with  the  first  of  these  questions  in  particular,  one  can 
examine  methods  of  descriptor  selection  that  operate  through  improvement 
of  an  initial  set  on  the  basis  of  experience  with  the  retrieval  Bystem. 

In  the  approaches  considered,  here  it  has  been  assumed  that  descrip¬ 
tions  consist  of  Boolean  combinations  of  descriptors,  and  the  possibility 
of  attaching  probabilities  to  the  descriptors  has  been  explicitly  admitted. 
The  major  task  is  then  the  selection  of  the  particular  descriptors  to  be 
used.  This  section  discusses  the  role  of  efficiency  in  descriptor  selec¬ 
tion  and  some  corrective  methods  for  improving  a  descriptor  set  under 
actual  operating  conditions.  In  Appendix  C,  Section  9.3,  some  of  the 
more  popular  existing  descriptive  schemes  are  described  and  discussed. 

4.2.1  Efficiency  Considerations  in  Descriptor  Selection 

4.2. 1.1  General  Criteria  -  In  a  collection  of  n  items  there  la 
only  a  finite  number  of  subcollections  of  items  that  are  theoretically 
possible  responses  in  item  retrieval  systems.  The  number  is  2W  if  zero 
items  are  considered  a  subcollection.  In  practice,  not  all  2n  answers 

!  '  !  •  "’S 

are  equally  likely  to  be  searched  for  by  a  user.  Intuition  suggests  that 
this  disparity  is  an  essential  criterion  for  the  effective  design  of  a 
query  or  descriptor  language. 

There  are  several  possible  approaches  to  specifying  which  of 
these  2n  subcollections  is  being,  referenced.  In  one  sense  the  simplest 
means  of  specification  is  to  assign  a  name  or  descriptor  to  each  of  the 
n  items  in  the  collection.  In  the  case  when  all  2n  subcollections  are 
requested  equally  often  and  when  the  questioner  knows  the  name  of  each 
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item  he  is  interested  in,  this  method  produces  an  adequate  system.  If, 
however,  some  subcollcctions  are  considerably  more  popular  than  others, 
then  an  obvious  improvement  In  coding  efficiency  would  result  from  giving 
popular  collections  special  category  names. 

There  are,  however,  considerations  other  than  information  theoretic 
measures  of!  ‘coding  efficiency  that  are  relevant  to  the  selection  of  a 
descriptor  language.  Asking  for  all  the  items  in  a  subcollection  by  name 
is  possible  only  when  the  names  of  all  the  documents  in  the  subcollection 
that  are  of  interest  are  known.  Under  these  circumstances  the  general 
problem  of  information  retrieval  becomes  a  special  case,  and  6nly  consider¬ 
ations  of  coding  efficiency  and,  perhaps,  user  compatibility  are  relevant 
criteria  for  descriptor  language  design, 

;i  In  an  ordinary  library  search  the  questioner  does  not  know  the 

|  .:>•  “  ■  •  ■ 

nsihcs  of  the  items  he  needs,  He  wants  the  system  to  supply  a  subcollec- 

.  i! 

tSibft  of  items  that  will  provide  information  relevant  to  his  query  after 
he  reads  them.  The  system  must  go  from  his  query  or  a  transformation  of 
his  query  to  an  appropriate  subcollection  of  items,  even  though  the  user 
does  not  yet  know  in  advance  what  is  in  this  Bubcollectlon. 

How  can  the  system  do  this?  One  approach  is  to  ask,  perhaps 
implicitly,  questions  in  advance  and  to  search,  again  implicitly,  the 
entire  collection  to  find  the  items  that  contain  information  relevant 
to  each  question.  The  system  would  then  have  the  stored  answer  available 
whenever  the  same  question  arose.  In  a  sizable  collection  it  is  not 
feasible  to  ask  all  questions  in  advance.  There  are  two  reasons:  first, 
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there  are  a  large  number  of  ways  of  asking  essentially  the  same  question; 
another  way  of  putting  this  point  is  that  the  same  answer  subcollection 
would  satisfy  many  possible  question  variations.  Second,  there  are  too 
many  possible  answers-- specifically,  2n--in  any  sizable  system. 

Each  of  these  difficulties  requires  a  different  approach.  The 
approach  to  the  former  involves  standardization;  that  is,  the  possible 
ways  of  asking  essentially  the  samil  question  must  be  restricted.  This 

n  U 

solution  is  primarily  a  language  problem.  The  approach  to  the  latter 
difficulties  involves  exclusion  of  less  probable  questions  and  their 

-I 

resultant  answers  from  advance  treatment.  This  solution  is  primarily  a 
system  design  and  organization  problem. 

"  }} 

,  How  is  explicit  or  implicit  advance  treatment  of  questions 

possible?  One  theoretically  possible  method  would  be  to  have  all  docu¬ 
ments  in  the  library  unordered,  except  perhaps  by  author  and  title,  for 
those  searches  in  which  the  querier  already  knows  which  documents  he 
wants.  Anyone  wishing  to  use  the  library  could  then  be  asked  to  submit 

'  ,(  ‘  i  « 

both  a  copy  of  his  question  and  a  list  of  the  documents  he  found  relevant 

i,  ' 

after  making  his  search  of  the  library.  This  Information  could  then  be 
stored  for  occasions  when  the  same  or  similar  questions  are  asked. 

Of  course,  this  scheme  is  impractical,  but  listing  some  of  its 
inherent  difficulties  may  lead  to  an  understanding  of  the  requirements 
of  an  ideal  descriptor- query  language. 

(a)  There  is  no  assurance  that  any  initial  questioner  will  do  a  good 
or  thorough  job  in  searching  all  the  documents  in  the  library. 
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(b)  Even  if  the  initial  questioner  has  done  a  perfect  job  at  the 
time  he  searched  the  library,  there  would  be  a  lack  of  infor¬ 
mation  about  the  relevance  of  new  accessions  to  the  question. 

Of  course,  new  accessions  could  be  re- searched  by  subsequent 
questioners  in  order  to  keep  the  answer  list  up  ,to  date. 

(c)  Many  questions  will  recur  imprecisely;  even  if  the  state¬ 
ment  of  the  question  is  identical,  different  users  are  likely 
to  have  different  meanings  of  intentions  that  would  influence 
which  documents  they  considered1! appropriate  for  the  answer  list. 
Thus,  even  if  there  is  a  perfect  and  up-to-date  search  performed 
by  the  initial  questioner,  it  is  not  likely  to  be  perfect  for  a 
subsequent  questioner, 

(d)  Such  a  system  would  impose  an  unacceptable  search  burden,  not 
only  upon  initial  questioners  but  also  upon  subsequent  ques¬ 
tioners,  if  there  are  a  substantial  number  of  new  acquisitions. 
Furthermore,  the  askers  of  somewhat  unusual  questions  would 
always  tend  to  be  in  the  role  of  initial  questioners,  regard¬ 
less  of  how  long  the  system  has  been  in  operation.  Their 
extensive  search  efforts  would  rarely  be  applied  by  subsequent 
users. 

The  technique  currently  used  by  most  libraries,  in  order  to  deal  with 
these  objections,  is  implicitly  to  select  a  range  of  questions  to  be 
pre-answered  and  then  to  assess  the  relevance  of  each  accession — i. e. , 
index  it — to  all  these  questions  as  it  is  entered  into  the  library  file. 
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To  the  extent  that  a  document's  rel.evahce  to  many  questions  can  be 
assessed  nearly  simultaneously,  this  technique  has  obvious  advantages 
over  repeatedly  scanning  each  document  for  each  question  in  seme  sequence 
of  questions. 

The  approach  of  classifying  each  accession  for  all  questions 
will  deal  completely  only  with  difficulties  (a)  and  (b).  Difficulties 
(e)  and  (d)  will  be  resolved  only  to  the  extent  that  the  question  list, 
against  which  each  document  is  implicitly  being  checked,  is  sufficiently 
extensive  and  to  the  extent  that  the  meaning  of  these  implicit  questions 
is  sufficiently  clear  to  the  system  users. 

•  ..  it 

It  is  likely  that  none  of  the  difficulties  Will  ever  be  resolved 
completely.  Even  a  user  searching  on  the  basis  of  his  own  question  is 
likely  to  introduce  inadvertent  errors  of  both  inclusion  and  exclusion 

I' 

on  the  answer  list  if  he  is  scanning  a  large  file  collection.  Similar 
errors  will  occur  when  a  librarian  classifies  a  book.  But  additional 
errors  will  result  from  the  fact  that  the  meaning  of  the  implicit  ques¬ 
tions  reflected  by  the  classification  varies  from  person  to  person. 

These  errors,  while  often  significant,  are  not  as  basic  a  prob¬ 
lem  as  the  limitation  on  possible  questions  that  can  be  answered.  These 
limitations  are  a  necessary  concomitant  of  indexing  a  large  collection. 

As  has  already  been  suggested,  there  are  two  kinds  of  limitations: 

(a)  Basic  limitations  on  the  retrieval  of  all  2n  answers.  In  general, 
no  indexing  scheme  for  a  sizable  collection  is  sufficiently  artic¬ 
ulated  to  allow  retrieval  of  all  possible  answers  without  knowing 
the  names  of  individual  documents. 
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(b)  Secondary  limitations  on  the  acceptability,  or  communicability, 
of  a  specific  question  formulation  that  doer,  in  fact  correspond 
to  one  of  the  accessible  answers. 

The  latter  limitation  does  not  necessarily  imply  any  change  in 
the  logical  organization  of  the  indexing  or  query- descriptor  language. 

Tho  problem  is  one  of  using  appropriate  names  or  labels  for  the  index 

terni| "or  combinations  of  index  terms  that  correspond  to  those  of  the  2n 

:ij,  „ 

answerd  that  the  system  is  capable  of  generating.  Of  course,  the  problem 

v  )) 

is  not  one  that  can  be  solved  merely  by  the  judicious  selection  of  terms. 

It  is  necessary  that  the  questioner  and  the  library  system  use  these  terms 
in  essentially  the  same  sense.  Furthermore,  it  is  necessary  .that  alternate 
descriptions  of  the  same  answer  or  question  be  interconvertible,  either  by 
the  library  system  or  by  the  user.  To  date,  the  only  methods  of  dealing 
with  this  problem  have  been  to  provide  the  user  with  a  dictionary-type 

description  of  the  index  terms,  an  over- view  of  the  relationship  among 

.1 

the  terms  used  by  the  system,  and/or  a  thesaurus  type  of  referral  ("see" 
and  "see  also")  to  related  terms. 

The  problem  of  converting  synonymous  descriptions  probably  can¬ 
not  be  approached  by  considering  the  relative  frequency  of  subcollection 
questions.  Of  course,  the  more  popular  a  subcollecticn,  the  more  valuable 
it  might  be  to  be  able  to  deal  with  alternate  ways  of  describing  it.  The 
problem  of  unaskable  questions,  however,  can  only  be  approached  fruitfully 
from  this  point  of  view.  If  the  system  is  to  be  insufficiently  articulated 
for  the  retrieval  of  all  2n  possible  answer  collections,  it  seems  that  the 
criteria  (other  than  random  exclusion  based  upon  cost  considerations)  for 
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deciding  which  subcollections  are  to  be  retrievable  should  ultimately  be 
based  upon  the  frequency  of  user  demand.  Only  those  questions  that  will 
l-arely  or  never  be  asked  should  in  principle  be  unanswerable- -without 
searching  the  entire  collectioa--because  of  limitations  in  the  query 
language  and  the  accompanying  file  structures  and  search  procedures. 

■  -  i  • 
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ThiSi  conclusion  suggests  that  a  second  consideration,  besides  -> 

■  "  •  ;  .  .  •  ( 

the  relative  frequency  of  War  demand  for  various  possible  answers,  may 

be  important.  This  consideration  is  the  absolute  level  of ^ demand  for  a 
possible  answer  subcolleetion.  The  absolute  level  of  demand  is  readily 

i " 

calculated  from  estimates  of  relative  demand  and  the  total  number  of  ' 
questions  asked.  An  estimate  for  the  number  of  questions  may. be  the 

length  of  time  for  which  the  collection  of  itemB  will  be  used  multiplied 

'  .  l  V 

by  levels  of  use  such  as  questions  per  day  during  this  interval.  As 
absolute  use  of  the  system  as  a  whole  increases,  more  articulate  index¬ 
ing  becomes  necessary  to  include  the  relatively  less  frequently  asked 
questions,  which  now  are  asked  a  significant  number  of  times  in  the 
system's  lifetime.  :i 

Answer  subcollections  should  not  merely  be  regarded  as  accessible 
or  inaccessible  with  a  given  query  capability.  Even  if  a  Bubcollection 
is  not  immediately  accessible,  there  are  degrees  of  desirability  that 
can  be  discriminated  with  respect  to  its  inaccessibility.  Thus  a 
desired  answer  subcollection  may  not  be  directly  accessible  per  se,  yet 
it  may  be  wholly  embedded  in  another  subcollection  that  is  accessible 
and  that  contains  few  additional  items.  Clearly,  there  is  no  great 
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deficiency  in  query  capability  under  such  circumstances  so  long  as  the 
user  can  identify  and  ask  for  the  appropriate  inexact  subcollection.  If, 
however,  the  items  in  a  desired  inaccessible  subcollection  are  widely 
scattered- -that  is,  the  items  cannot  be  obtained  without  searching  a 
number  of  accessible  subeoliectione--the  situation  is  quite  different. 

This  difficulty  is  likely  to  be  further  complicated  by  the  inherent 
v  unavailability  of  information  about  which  accessible  subeollectlonB 
contain  the  items  the  user  needs.  Under  such  circumstances  the  user 
may  be  reduced  to  searching  the  entire  collection,  or  unacceptably  large 
parts  of  it,  in  order  to  obtain  the  needed  Information!. ... ~  - 
fruitful  to  develop  rigorous  measures  of  degree  of  inaccessibility  based 
upon  minimal  and/or  maximal  false  drops  and/or  misses. 

Such  a  measure  of  accessibility  could  be  used  to  evaluate  the 
1!  goodness  of  any  descriptor  scheme  for  any  item  collection.  More  precisely 
;!\|t  could  be  used  to  measure  the  average  ( inaccessibility  for  the  power 
bet  of  items,  the  set  of  2n  possible  answers,  for  a  given  descriptor 
scheme.  When  combined  with  information  about  relative  frequencies  of 
the  members  of  the  possible  answer  set,  such  a  measure  can  provide  infor¬ 
mation  about  the  average  accessibility  of  items  jter  request.  One  purpose 
of  a  general  theory  of  information  retrieval  is  to  provide  an  analytical 
framework  in  which  this  quantity,  the  average  accessibility  per  request, 
can  be  optimized,  given  a  context  of  relevant  system  parameters. 

4 . 2 . 1 . 2  Factors  That  Govern  the  Criteria  of  Relative  Importance 
of  Descriptors  -  If  a  large  collection  of  documents  is  classified  in  some 
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fashion,  each  document  in  this  collection  is  labeled  by  one  or  more 
descriptors.  However,  not  all  descriptors  are  equally  important;  the 
deletion  of  some  would  hardly  affect  the  retrieval  processes,  but  the 
deletion  of  others  would  be  detrimental. 

The  unchecked  proliferation  of  descriptors  may  diminish  the 
usefulness  of  a  collection  of  documents  either  by  lengthening  the 
physical  processes  involved  in  retrieval,,  b$r  confusing  thb  taxonomlcal 
logic  of  the  o^lleefeiea, ;ftr  ;by; simply,  straying  too  far  from  the  natural 
usage  of  terms.  In  any  case,  it  is  usually  prudent  to  restrict  the 
number  of  new  descriptors  that  may  introduced  in  order  to  keep  the 
retrieval  processes  near  peak  efficiency.  v 

Under  such  conditions  the  choice  and  the  allocation  of  descrip¬ 
tors  may  be"  governed  by  criteria  of  descriptor  importance.  In  addition,, 

•  ‘1 

the  criteria  used  in  automatic  indexing  procedures  may  necessarily  lean 
more  towards  the  use  of  staijiptioal  information  about  the  collection 

\l  "  “  ‘ 

than  is  the  case  when  indexing  is  done  manually.  To  put  the  same  ideas 
differently  and  more  strikingly,  when  indexing  is  performed  automatically 
the  governing  criteria  may  pertain  more  to  statistical  distributions  of 
descriptors  among  thff  documents  than  to  explicit  relations  between  the 
subject  matter  of  a  given  document  and  a  descriptor. 

Given  these  premises,  the  factors  that  govern  the  relative 
importance  of  descriptors  are: 

(a)  Let  us  suppose  that  a  certain  descriptor  is  never  mentioned  in 

any  of  the  retrieval  requests.  Obviously  such  a  descriptor 
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could  be  deleted  from  the  collection  without  loss.  Conversely, 
descriptors  used  with  high  frequencies  have  a  high  probability 
of  being  important.  At  present,  we  can  only  speak  of  the 
higher  probability  of  importance  since  the  relation  of  various 

factors  to  each  other  has  not  been  formalized.  So  far  as 

\  - 

frequency  relations  i|  are  concerned,  a  certain  assymetrical 

situation  exists.  Below  a  certain  frequency  threshold  the 

\  "  ' 

frequency  considerations  are- overwhelming.  If  a  descriptor 

...  .1 

- _,i.e,  not- used. with-  a  cef-tain-minimuB!  frequency,  it  -  cannot  be 

ranked  high.  However,  the  high  frequency  descriptors  are  not 
necessarily  important-.  For  example,  a  high  frequency  .descriptor1 
may  be  synonymous  with  another  descriptor. 

V, 

(b)  Descriptors  are  usually  employed  jointly.  The  importance  of 
a  descriptor  ie  influenced  by  "the  company  it  keeps."  A 
descriptor  may  have  little  "actual  discriminatory  power" 
vis-a-vis  descriptors  that  co-occur  in  a  representative 
retrieval  request.  For  example,  let  us  assume  that  a  certain 
descriptor  say  D  is  used  jointly  with  descriptors: 

AlW^ 

BlB2B3B4 

and 

C1C2C3CU 

Let  us  assume  that  the  increment  of  the  retrieval  collection 
due  to  the  deletion  of  D  is  in  each  of  the  cases  from  498  doc¬ 
uments  to  500  documents.  The  average  "actual  discriminatory 
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■r 

power"  of  the  B- descriptor  is  low. 

(c)  The  average  number  of  descriptors  used  in  retrieval  calls  con¬ 
taining  a  given  descriptor  is  an  important  indicator  of  the 

border  of  importance.  Other  things  being  equal,  one  may  expect 
that  a  descriptor  that  co-occurs  with  large  numbers  of  other 
,?  descriptors  in  retrieval  requests  is  of  lesser  importance  than 

one  that  eo- occurs  with  few,  since  the  absolute  number  of  doc- 
■«-  uments  excluded  by  the  descriptor. ,ih  the.  letter. .  case  is  greater., 

These  considerations  dealt  with  descriptors  as  used  in  retrieval 
and  have  to  do  with  the  actual  usage  of  descriptors.  To  distinguish  these 
considerations  from  those  pertaining  to  the  potential  ftaage,  the  next 

....  1  '  '  U 

set  of  factors  deal  with  factors  not  direotly  related  to  actual  usage. 

"...  :i 

•.  >  f 

These  factors  are  dependent  only  upon  the  distribution  of  descriptors 

!  "  i  ,  -J 

among  documents  and  not  with  their  occurrence  in  retrieval  calls: 

(d)  The  larger  the  size  of  a  document  act  spanned  by  a  descriptor, 
the  greater  will  be  its  ranking  on  the  importance  scale. 

(e)  Corresponding  to  the  "actual  discriminatory  power"  of  a 
descriptor  there  1b  the  "potential  discriminatory  power." 

The  potential  discriminatory  power  of  a  descriptor  measures 
the  uniqueness  of  its  coverage.  It  is  computed  in  the  same 
way  as  the  actual  discriminatory  power,  except  that  the 
descriptor  combinations  to  b«s  considered  are  not  derived  frem 
usage  statistics.  A  descriptor  will  have  potential  discriminatory 
power  of  zero,  if  any  retrieval  request  involving  that  descriptor 
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can  be  replaced  by  a  different  request  not  involving  that 
descriptor  with  no  change  in  the  set  of  retrieved  documents. 

On  the  other  hand,  if  many  sets  of  documents  can  be  retrieved 
only  as  small  subsets  of  other  retrievable  sets  when  the 
descriptor  in  question  is  not  used,  then  the  descriptor  has 
high  potential  discriminatory  power.  / 

(f)  A  set  spanned  by  a  descriptor  may  intersect  sets  spanned  by 
closely  related  descriptors  or  by  sets  spanned  by  descriptors 
remote  from  one  ahother.  Such  characteristics  may  be  called 
a  measure  of  dispersion  of  a  descriptor.  Other  things  being 

I 

equal,  the  more  dispersed  a  descriptor  is,  the  less  highly  will 
it  rank.  This  fact  is  so  because  with  high  dispersion  in  any 
particular  retrieval  call  the  higher  proportion  of  retrieved 
documents  may  be  expected  to  be  only  marginally  relevant  to 

If 

the  request. 

4. 2. 1.3  Statistical  Data  Required  for  the  Determination  of  the 
Order  of  Descriptor  Importance  -  Unfortunately  not  all  the  factors  men¬ 
tioned  in  the  preceding  section  can  be  conveniently  measured.  For  some 
factors  the  amount  of  bookkeeping  required  is  close  to  astronomical.  | 
Therefore,  one  must  take  recourse  to  convenient  substitutes  that  encap- 
sulc  the  essential  information  without  too  much  leakage  and,  at  the  same 
time,  reduce  the  requisite  amount  of  data  handling  and  bookkeeping. 

The  important  consideration  that  has  to  be  kept  in  mind  is  that 
detailed  accounts  of  intradescriptor  relationships  cannot  be  kept.  For 
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example  with  10,000  descriptors  there  are  2-^;000  pOSSj_ble  combinations 
or  descriptors  and  if  even  .01  percent  of  these  are  active  (i.e.,  there 
are  some  documents  that  are  indexed  by  them)  the  number  of  entries  that 
would  have  to  be  retained  is  astronomical.  It  is  only  necessary,  there¬ 
fore,  to  keep  track  of  selective  data  on  the  basis  of  which  the  important 
intradescriptor  relationships  could  be  approximately  reconstructed. 

.  The  most  difficult  problem  will  consist  of  trying  to  reconstruct 

ijhe  "dispersion"  and  the  "discriminatory  power"  of  the  descriptor  set. 
Tentatively,  the  following  set  of  parameters  is  suggested  as  a  basis  for 
further  study:  | 

(a)  Toval  document  span  of  individual  descriptors.  ,, 

(b)  Frequency  of  recall  of  individual  descriptors.  j 

(c)  The  number  of  documents  spanned  by  a  given  descriptor  in 
company  with  either  k  descriptors  where  k  is  1,2, ...,a  -  1* 

(d)  The  document  span  of  an  average  descriptor v contained  in  a  set 
of  k  of  them  present  with  a  given  descriptor. 

(e)  The  frequency  of  recall  of  an  average  descriptor  contained  in 
a  set  of  k  of  them  present  with  any  given  descriptor. 

(f)  The  number  giving  an  overlap  measure  of  an  average  descriptor 
contained  in  a  set  of  k  descriptors  present  with  a  given 
descriptor. 

4. 2. 1.4  Summary  -  In  a  collection  of  n  documents,  there  are 
2n  possible  subcollections  if  the  empty  collection  is  included.  In 
practice,  not  all  2n  subcollections  are  equally  likely  to  be  searched 
for  by  a  user.  Any  descx'iptive  scheme  should  be  based  on  this  fact  and 
designed  in  such  a  way  that  useful  subcollections  have  simple  descriptions 
A  query  to  the  system  selects  a  particular  subcollection.  With  each 
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sub collection  a  measure  of  accessibility  can  be  associated;  this  measure 
indicates  the  complexity  of  the  query  required  to  retrieve  the  subcollection 
Certain  statistical  measures  have  been  presented  that  could  be  used  to 
measure  the  value  of  a  descriptor  in  constructing  descriptions  specif¬ 
ically,  the  concepts  of  "dispersion"  and  "discriminatory  power"  are 
defined.'  Some  of  the  data  that  would,  be  useful  in  computing  these  meas-i 
uros  have  also  been  described. 

4.2.2  Corrective  Procedures  for  Indexing  Systems 

4. 2. 2.1  General  -  This  section  investigates  the  methods  and 
feasibility  of  applying  corrective  procedures  to  indexing  systems.  A 
fundamental  aspect  of  these  concepts  is  their  ultimate  adaptability  to 
automated  procedures.  The  first  part  of  this  discussion  presents  the 
basic  ideas  of  this  concept;  the  second  part  develops  the  concept 

ii 

formally. 

4. 2. 2.2  The  Taxonomy  of  Indexing  SyBtems  -  Information  retrieval 
systems  consist  of  a  collection  of  documents  and  setj  of  indexing  rules 

•3 

and  procedures  for  linking  descriptors  to  documents.!  The  documents  in 
this  context  refer  to  the  smallest  ensemble  of  information  subject  to 
retrieval;  these  documents  are  considered  as  being  indivisible.  The 
indexing  rules  and  procedures  theoretically  select  descriptors  that  bear 
some  relation  to  the  descriptors  used  by  people  who  will  interrogate  the 
system. 


The  system  may  accept  new  documents;  the  documents  are  then 
classified  according  to  the  rules  and  procedures  of  the  indexing  scheme 
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of  the  system.  The  system  is  not  necessarily  committed  to  .the  use  of 
eld  descriptors..  The  indexing  rules  allow  for  the  supply  of  new  descrip¬ 
tors  with  the  acceptance  of  the  new  documents  by  the  library. 

»>  ....  -  "...  ■  .  >. 

The  user  specifies  his  requests  for  information  by  writing  a 

/  ■  ,»  >  ' 

sbquenb.e  of  acceptable  descriptors  in  the  form  of  a  Boolean  function; 

that  is,  the  descriptors  are  joined  by  OR  and  AHD.  The  user's  disposi¬ 
tion  of  the  descriptors  implies  the  existence  of  an  ideal  taxonomic  system 
The  taxonomy  imposed  by  the  indexing  rules  and  procedures  constitutes  an 
external  taxonomy  or  a  priori  taxonomy.  / 

A  corrective  procedure  will  cause  the  external  taxonomy  to 
evolve  into  the  ideal  taxonomy  on  the  basis  of  information  concerning 
the  adequacy  of  the  sets  of  documents  retrieved.  This  information  is 
supplied  by  the  user. 

The  central  problem  is:  On  what  factors  does  the  functioning 

'  •  .  _  i 

of  a  corrective  procedures  depend?  The  answer  to'thiB  problem  depends 

upon  the  elucidation  of  the  relation  between  the  ideal  and  the  external 

taxonomy.  More  specifically,  the  hypothesis  depends  upon  the  concept  of 

Invariance .  Invariance  pertains  to  the  a. , priori  postulated  constancy 

between  descriptors  in  the  two  taxonomies. 

This  discussion,,  then.,  will,  advance  the  hypothesis  that: 

(a)  The  concept  of  relatedness  of  descriptors  can  be  made  numerically 
precise. 

(b)  The  concept  of  relatedness  can  serve  as  a  building  Mock  for 
more  complex  relationships'  between  descriptors. 


(c)  Some  such  relationships  are  postulated  as  being  constant} 
i.e.,  these  relationships  remain  invariant  in  both  the 
external  and  the  ideal  taxonomies. 

(d)  The  existence  of  such  constancies  forms  the  basis  for  select¬ 
ing  rules  of  reassigning  descriptors  among  documents. 

The  remainder  of  this  section  will  attempt  to  validate  this  hypothesis 

and  describe  the  resultant  consequence? .  ‘ 

1 

U.2.2.3  Formalization  of  the  Hypothesis  •>.  Let  dg,...,^ 

and  D,",  Dn , , . , ,D  be  descriptors  and  documents,  respectively.  For  every 
j.  &  n 

descriptor  there  corresponds  a  class  of  documents  'Spanned  by  this 
descriptor.  In  set-theoretio  notation  this  concept  becomes: 

[D:  d(D)  -  dA(D)l  (1) 

1  "  ■  "  "  ?l  V. 

which  may  be  read  as  "the  set  of  all  documents  such  that  descriptor  d^ 
applies  to  the  set."  To  avoid  cumbersome  notation,  the  abbreviation 
[D(d)]  will  be  used  to  represent  the  set.  The  number  of  documents  con¬ 
tained  in  such  a  set  will  be  denoted  by  M.  Then  MCD^)]  stands  for  the 
number  of  documents  contained  In  the  set  spanned  by  the  descriptor  d^. 

-  "  ‘  u'i 

In  general,  to  every  Boolean  function  of  descriptors  there  cor¬ 
responds  a  set  of  doouments  spanned  by  these  descriptors,  Therefore, 
"the  set  of  all  doouments  that  are  indexed  by  B(d),"  becomes: 

tD(B(d))]  (2) 

For  example, 

[D^  a  (dg  v  d3))]  (3) 

is  a  set  of  all  documents  that  have  as  their  indices  the  descriptors 
d^  and  dg  or  d^  or  both,  among  others.  It  is  clear  that  the  following 
relation  holds: 


[D(B(d))]  =  B[(D(d))] 


(U) 


This  expression  signifies  that  the  set  of  all  documents  spanned  by  a 


Boolean  function  of  descriptors  is  equivalent  to  the  Boolean  function 
of  sets  spanned  by  these  descriptors.  By  analogy,  the  expression  [d(B(D))] 

f  1 

represents  a  set  of  predicates  contained  in  the  set  bf  documents  described 
by  the  Boolean  function  B(D). 


The  relatedness  of  descriptors  or  their  Boolean  functions  is 
.defined  as  the  munber  of  documents  contained  in  the  intersection  of 
olasses  spanned  by  these  descriptors  or  their  Boolean  functions  divided 
by  the  number  bf  documents  spanned  by  the  union.  Formally,  this  defini¬ 
tion  beoomes i 


V'B^d),  B3(d)] 


MCSjWd))  A  B^(D(d))] 
'  K[B1(D(d))  V  Bj(D(d))] 


(Definition  1) 

(5) 


A  similar  oonoept  of  the  relatedness  of  documents  or  their  Boolean  func¬ 
tions  Is  defined  analogously: 


RjjCb^Cd),  Bj(d)] 


H\(d(D))  A  B^(d(D))] 
"  VBjtd(D))] 


(Definition  2) 

(6) 


It  Is  Important  to  note  that  throughout  this  discussion  the  oonoepts  for 

•i 

descriptors  oan  be  analogously  applied  to  documents.  The  subsequent 
development,  however,  will  be  limited  to  the  relatedness  of  descriptors. 


Since  the  external  t&xonony  by  hypothesis  does  not  preoisely  cor¬ 
respond  to  the  ideal  taxonomy,  the  distinct  symbol,  6,  is  introduced  to 
represent  the  descriptors  of  the  user.  These  descriptors  are  only  dif¬ 
ferent  Insofar  as  they  index  olasses  of  documents  that  are  not  identical 


with  the  classes  of  documents  indexed  by  the  descriptors  of  the  external 
taxonomy.  Thus  for  any  descriptor  or  index  i,  [d^(D)]  and  [8^(D)]  are 
not  necessarily  identical,  even  though  the  descriptors  themselves  may  be 
the  same.  The  objective  of  corrective  procedures  is  to  adjust  the  appli¬ 
cation  of  descriptors  to  documents  so  that  the  two  sets  become  identical. 
The  corrective  procedures  may  have  fulfilled  their  task  if  the  objective 
is  approximated  to  the  extent  that  any  divergence  has  a  negligible  impact 
upon  the  user. 

'  If 

U.2.2.U  The  Basle  of  Corrective  Procedures  -  Assume  that  all 
retrieval  requests  oonaist  of  single  descriptors.  The  user  formulates 

ii 

his  request  in  terms  of  a  descriptor  related  to  the  ideal  taxonony. 

The  system  retrieves  all  doouments  spanned  by  this  descriptor,  except 

that  this  descriptor  is  d^  in  the  external  taxonomy.  The  user  then 

deoides  whether  the  retrieved  collection  of  documents  is  satisfactory. 

The  collection  may  not  satisfactorily  fulfill  the  user's  requirements 

for  three  reasons:  / 

•  *'  •*  "  ’• 

(a)  Too  many  documents  were  collected.  1 

(b)  1  Too  few  documents  were  collected.  ■ 

,i 

(c)  Some  documents  are  superfluous  and  some  are  missing.. 

The  corrective  procedures  should  select  documents  more  in  consonance 
with  user's  needs  and  then  effect  permanent  changes  in  the  application 
of  descriptors  to  documents. 

If  the  system  retrieves  too  many  documents,  the  system  may 
select  a  set  of  descriptors  that  are  most  related  to  the  user's  descriptor 
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and  then  remove  from  the  retrieved  set  those  documents  spanned  by  the 
related  descriptors.  This  method  conceals  a  difficulty.  Although  a 
measure  for  relatedness  of  two  descriptors  has  'been  defined,  no  tech¬ 
nique  hih  yet  been  specified  to  select  clusters  ox  most  related 
descriptors. 

If  the  system  .retrieves  too  few  documents,  a  set  of  descriptors 
most  olosely  related  to  the  given  descriptor  is  assembled!  the  set  may 
he  limited  to  a  single  descriptor.  A  Boolean  funotlon  of  those  descrip¬ 
tors  is  then  constructed,  and  doouments  spanned  by  the  Boolean  function  i! 

..  ij 

are  retrieved.  The  factors  that  determine  the  nature  of  the  particular 
Boolean  function  of  descriptors  must  still  be  Refined, 

If  some  doouments  are  superfluous  and  some  are  missing,  the 
problem  may  be  handled  as  a  combination  of  the  specific  problems  of  too 
many  or  too  few  doouments.  More  realistically,  however,  some  problems 
of  this  type  are  eui  generis,  and  specific  solutions  must  be  developed.  . 

After  the  originally  inadequate  set  of  documents  is  deleted  to 


permanent  changes  in  the  extension  of  some  descriptors  so  that  the 
denotation  of  the  external  and  ideal  descriptors  approach  equivalence. 

The  problem  is  to  render  the  sets  [5^(D)]  and  [d^D)]  extenslonally  as 
similar  as  possible.  Several  corrective  procedures  may  be  usedt 

(a)  To  affix  the  user's  descriptor  to  all  the  doouments  and  only 
those  doouments  in  the  aooeptable  retrieved  set. 

(b)  To  delete  or  add  some  descriptors  selectively  from  the  set  of 
doouments  spanned;  after  the  process  of  deletion  or  augmentation. 
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(c)  To  delete  or  add  some  descriptors  selectively  to  the  documents 
that  were  deleted  o;r  complemented  from  the  originally  inadequate 
retrieved  set, 

(d)  To  effect  other  descriptor  changes  on  the  document  not  affected 
by  the  processes  of  complementation  or  deletion. 

The  first  procedure  by  itself  will  not  produce  the  desired  trans¬ 
formation  until  all  descriptors  have  been  used  in  retrieval  processes  at 
least  once,  This  prospect  is  uninviting  for  any  document  collection  with 
a  large  number  of  descriptors.  If  such  procedure  were  feasible,  there ' • 
would  b4;  no  reason  not  t^  index  the  entire  oolleotion  in  the  ideal  tax¬ 
onomy,  in  the  first  plaoe.  In  addition,  the  procedure  of  complementing 
the  original  set  of  documents  need  not  neoessarily  lead  to  the  formation 
of  a  taxonomy  whose  extension  is  identical  to  the  ideal.  Rather,  the 
process  may  only  be  an  approximation $  that  is,  a  set  obtained  after  a 
aeries  of  complementations  may  only  approximate  the  ideal  taxonomy. 

A  closer  look  at  the  remaining  three  procedures  and  their 
Inherent  problems  is  necessary.  Consider  a  class  of  documents  tD(d^)] 
spanned  by  descriptor  d^,  Suppose  that  the  user  requests  all  documents 
under  the  descriptor  a  desoriptor  corresponding  to  d^.  The  class, 
[D(d^)]  is  retrieved}  it  does  not  fulfill  the  user's  requirements.  The 
complementation  prooedure  results  in  formation  of  a  new  class  [D'(d^)], 

The  corrective  procedure  should  then  implement  changes  pertaining  to  the 
distribution  of  the  remaining  descriptors  among  documents.  How  should 
these  changes  be  made?  Or,  to  rephrase  this  question,  on  what  should 
the  inferential  processes  be  based  in  order  to  ensure  that  the  ideal 
taxomony  is  approximated? 


Assume  that  there  is  no  relation  between  the  external,  and  the 
ideal  taxonomies.  In  this  case  the  first  stage  of  the  correct! re  pro¬ 
cedure  “-that  iss  the  complementation  of  the  selected  set— must  proceed 
at  random.  If  the  taxonomy  imposed  upon  the  collection  of  documents  is 
not  correlated  with  the  taxonomy  implied  by  the  user,  then  the  relatedness 
of  descriptors  to  one  another  •will  be  of  no  help  either  in  reassigning 
descriptors  or  in  complementing  the  original  sets. 

.The  possibility  of  developing  corrective  procedures  depends, 
therefore,  upon  some  a  priori  relation  between  the  two  taxonomio  systems. 
If  suoh  relationships  exist,  then  it  must  be  expressible  in  terms  of  the 
concept  of  relatedness.  ,  The  relatedness  of  descriptors,  in  one  system, 
must  resemble  the  relatedness  In  the  other.  The  concept  of  a  relatedness 
between  two  taxonomio  systems  isolates  the  particular  Invariance  that 
characterizes  the  sets  of  documents  designated  by  certain  descriptors. 
Formally,  an  invariance  exists  if  d^Rdj  is  true  whenever  ^  is  true, 
where  R  is  a  relationship  between  descriptors.  There  need  not  be  some 
universal  type  of  invariance  present  whenever  there  is  a  resemblance 
between  two  taxonomic  systems.  On  the  contrary,  depending  upon  the 
nature  of  the  data  to  be  retrieved,  the  dnvaiianoe  between  the  ideal 
and  the  external  taxonomy  may  differ. 

Some  formal  examples  may  clarify  the  concept  of  invariance. 

First,  if  a  set  of  documents  spanned  by  a  descriptor  in  one  system  con¬ 
tains  another  set  of  documents  spanned  by  another  descriptor  and  if  this 
condition  implies  the  same  condition  for  the  corresponding  descriptors 
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in  the  other  system,  then  the  invariance  might  be  called  nested  invariance 
Formally : 

CD(d^)]  =>  [D(dk)]  -  [D^)]  =  [D(6k)]  (7) 

where  -♦  indicates  "implies,"  and  3  indicates  set  inclusion. 

In  a  second  example  the  most  closely  related  descriptors  in  one 
system  are  also  most  closely  related  in  another.  To  represent  this  type 

v/ 

of  invariance  formally,  let  (d^,  d^)*  be  an  ordered  pair  of  descriptors 
that  are  related  to  each  other  as  follows* 

(dj)]  “  (dj,.)]  (^r  <*11  k)  (8) 

If  then  (d^,  dj)  -»  (6^,  5j)  ,  the  relationship  of  being  most  closely 
related  is  preserved. 

■■)  ■  .  i' 

■  ii  ” 

The  third  example  replaces  MAI  by  MIN  to  obtain  an  invariance 
of  being  the  least  olosely  related  descriptor.  In  spite  of  the  formal 
similarity  between  the  most  and  least  closely1 related  conditions,  there 
is  a  formidable  practical  difference.  The  most  closely  related  condi¬ 
tion  preserves  an  invariance  between  a  descriptor  and  a  descriptor)  the 
least  closely  related  condition  preserves  an  invarianoe  between  a  descrip¬ 
tor  and  a  class  of  descriptors. 

As  a  fourth  example  the  concept  of  most  closely  related  descrip¬ 
tors  may  be  applied  to  chains  of  descriptors.  In  such  a  relationship  one 
descriptor  leads  to  another  to  form  an  associative  chain.  There  are  many 
non-equivalent  ways  of  formulating  the  conditions  for  the  existence  of 
such  a  chain.  One  is  to  let  <  d^,  ...,dn>be  an  associative  chain  of 
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n-—  ord'jr.  Then  this  chain  is  defined  as: 

(a)  The  set  d^,0..»dn]  of  descriptors  comprised  in  the  chain 
contains  each  element  except  the  first  and  the  last  only  once, 

(b)  The  first  element  appears  twice;  it  is  also  the  last  element. 

The  first  and  last  elements  are  linked  to  complete  the  chain. 

(c)  Each  element  except  the  first  determines  its  successor  by 
selecting  the  second  most  related  descriptor.  The  first 
descriptor  determines  its  successor  by  selecting  its  most 
related  .neighbor. 

th 

Then,  if  every  associative  chain  of  n —  order  in  one  taxdhomio  system 

4*h 

corresponds  to  a  chain  in  another,  a  chain  invariance  of  order 

!! 

exists.  The  element^  in  one  ehain  correspond  to  the  elements  in  the 
other,  but  not  necessarily  in  the  same  order. 

There  are  a  number  of  additional  possible  relationships  that 
remain  invariant.  The  problem  is  to  select  those  that  realistically 
relate  to  the  properties  of  data  structures  and  their  associated  index¬ 
ing  systems. 

If  these  invariances  exist,  formal  rules  for  reassigning  the 
descriptors  may  be  deduced.  The  concept  of  invariance  places  a  strong 
constraint  upon  the  type  of  admissible  rules  that  oan  be  formulated. 

There  Is  also  a  relation  between  the  invariances  and  the  nature  of  the 
convergence  and  efficiency  criteria  imposed  upon  the  corrective  procedures. 
The  important  question  is:  Given  a  specific  form  of  invariance  and  the 
appropriate  rules  for  complementing  sets  and  for  reassigning  descriptors, 
how  many  queries  must  elapse  before  the  external  taxonomy  approximates 
the  ideal?  (Approximation  in  this  sense  may  mean  either  the  probability 
of  obtaining  a  set  that  is  too  small  or  too  large  by  a  specified  margin.) 


A  comparison  between  one  type  of  invariance  and  another  now 
becomes  possible .  These  invariances  that  result  in  a  quick  convergence 
of  the  corrective  procedures  are  desirable.  Conversely,  it  is  possible 
to  investigate  the  suitability  of  rules  for  complementing  and  reassigning 
descriptors  by  keeping  a  set  of  invariant  relationships  constant.  All 
these  problems  can  be  investigated  mathematically.  j 

b.2,2,5  Summary  -  There  is  an  inherent  problem  in  accommodating 
the  descriptors  selected  for  a  set  of  documents  by  indexing  rules  to  the 
descriptors  used  by  the  user  of  a  system.  This  problem  is  related  to 
the  extensional  difference  in  the  denotation  of  descriptors  or  words  in 
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an  external  and  an  ideal  taxonomy.  This  discussion  described  methods 
for  developing  corrective  procedures,  whioh  could  be  applied  automatically, 

to  relate  the  external  to  the  ideal  taxonomy.  The  basis  for  developing 

//"  • 

the  inferential  rules  for  these  procedures  is  the  concept  of  invariance. 

It. 3  ASSIGNMENT  OF  DESCRIPTORS  TO  DOCUMENTS 

The  major  work  performed  on  the  assignment  of  descriptions  to  documents 
lias  been  on  the  development  of  automatio  indexing  methods  based  , ,bn  clue 

words.  The  approach  assumes  that  an  initial  set  of  categories  Has  been 

j 

')  l 

set  up  by  a  group  of  human  experts  and  that  there  is  available  a  test 
body  of  documents  that  can  be  used  to  extract  the  basic  parameters  used 
in  automatic  indexing.  The  basic  thesis  of  this  approach  is  that  the 
occurrence  of  certain  words  in  a  document  indicates  the  correct  categoriza¬ 
tion  of  that  document}  i.e.,  the  descriptor  most  appropriate  to  it.  A 
variation  on  this  approach  is  the  use  of  game -theoretic  methods  to  find 
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those  clue  words  that  maximize  the  probability  of  correct  classification; 
using  this  approach,  the  choice  of  clue  words  determines  the  choice  of  a 


classification  algorithm. 


The  description  assigned  to  a  document  need  not  be  composed  $> f 

I 

descriptors  in  the  usual  sense.  Automatic  abstracting  provides  fa  tech¬ 
nique  for  generating  more  logically  complex  document  descriptions;  the 
descriptive  language  in  this  case  has  all  the  richness  of  human  language, 
Some  , investigations  in  this  area  are  described  below.  .  .  •  *■/" 


It. 3.1  Information  Theoretical  Methods  of  Document  Categorization  - 

t  %  ■■ 

This  seotion  presents  some  applications  of  information  theory  to  the  | 

'I  \ 

problem  of  document  class if ioation  or  categorization.  Criteria  for  a 
good  categorize]*  are  presented,  and  various  information  theoretioal 
measures  that  measure  the  goodness  of  oategorizers  are  examined.  1 


The  problem  of  document  categorization  is  the  problem  of  selecting 

from  a  set  of  possible  categories  those  oategories  to  whioh  a  dooumen^; 

’)  ...  /  1 
may  belong.  This  selection  would  have  -to  be  based  upon  certain  dues 

or  indications  found  in  the  document  itself.  Thus;  as  Maron  [4 "j  lute 
stated,  the  problem  of  categorization  oan  be  divided  into  two  parts; 
the  selection  of  certain  relevant  aspects  of  a  document  m  clues  toward 
classification;  and  the  use  of  these  clues  to  predict  the  proper  cat¬ 
egory  to  which  the  document  belongs.  Once  the  method  of  classification 
has  been  defined,  then  the  procedures  could  be  automated. 


Many  authors  [3,  U,  8,  16,  Ui,  67]  have  felt  that  the  occurrence 
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of  certain  words  in  a  document  provided  excellent  indications  of  the 
category  to  which  that  document  belonged.  Based  upon  word  occurrence 
statistics,  document  categories  would  be  predicted  automatically,  This 
approach  is  also  developed  here,  but  certain  information  theoretical 
techniques  are  appliedfbhat  do  not  appear  to  have  been  applied  elsewhere. 

This  approach  assumes  that  a  group  of  human  experts  will  initially 
classify  a  number,  of  documents  into  a  given  set  of  categories.  A  basio 
assumption  is  that  all  categories  ihat  reoeive  one  or  more  documents  will 
be  retained  as  permanent  categories,  which  will  be  the  only  categories 
used  in  the  future ,  pother  assumption  is  that  the  number  of  documents 
Initially  classified  by  experts  is  large  enough  so  that  the  statistics 
of  this  group  may  be  assumed  to  reflect  the  statistics  of  the  body  of 
documents  that  may  later  be  automatically  oategorized.  In, other  words, 
relative  frequencies  of  categorization  obtained  from  the  initial  group 

will  be  used  as  the  probabilities  of  categorization  of  the  larger  group. 

*  ))  5  v 

U.3.1.1  Basio  Approach  to  Automatio  Classification  Using  Word 
Oeourrenoe  Information 

U. 3. 1.1.1  Criteria  for  Selecting  Predictors  -  It  is  expected 
that  the  occurrence  of  certain  words  in  a  document  indicates  the  cate¬ 
gorization  of  that  document.  It  follows  that  one  of  the  criteria  for 
selecting  a  particular  word  to  predict  categories  is  that  its  occurrence 
in  documents  be  strongly  correlated  with  the  appearance  of  those  docu¬ 
ments  in  a  particular  category — for  those  documents  that  were  initially 
classified.  In  other  words,  a  word  that  appears  in  every  document  of  a 


particular  category  and  appears  in  no  document  of  any  other  category 
seems  to  be  an  ideal  predictor  of  that  category.  In  practice  there  may 
be  few  of  these  ideal  predictors}  that  it  is  necessary  to  look  for 
words  for  which  occurrence  in  a  document  means  a  particular  category 
for  that  document  is  much  more  likely  than  any  other  category. 

This  criterion  would  be  sufficient  for  choosing  indicator  words 
if  the  distribution  of  documents  in;the  categories  were  uniform.  In 
practice,  this  condition  would  generally  not  be  the  case}  some  categories 

v...  "  ■'  ./  I 

would  have  many  more  documents  than  others .  Then  a  word  that  would  seem 

■  "  "  t 

to  be  an  excellent  indioator  might  be  found  to  .supply  no  more  information 

/  10^  <fi 

than  the  total  distribution  of  documents  supplied.  Thulk  /the  Recurrence 
of  the  good  indicator  word  in  documents  must  not  only  be/ strongly  cor¬ 
related  with  the  classification  of  these  documents  in  one  particular 
category,  but  the  distribution  of  documents  containing  this  word  must 
also  markedly  differ  from  the  distribution  of  all  the  documents, 

lti'3.1,1.2  Mathematical  Statement  of  the  Problem  -  The  problem 
oan  now  be  expressed  mathematically s  Qiven  N  documents  classified  into 
Cj  categories,  where  ,j  ■  1, The  vocabulary  of  the  N  doouments 
contains  m  words,  W^,  i  ■  1, . ,,,m.  Word  occurs  in  documents,  and 
n^  of  these  documents  fall  into  category  . 

Let? 

p(Cj)  °  the  probability  that  a  document  falls  into  category 

*The  classification  of  a  document  into  two  or  more  categories  is  counted 
a3  the  classification  into  one  category  each  of  two  or  more  documents. 
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p(C.|W.)  =  the  probability  that  a  document  with  the  word  W. 
1  falls  into  category  1 


P(V  ’  pa  ’ V* 


Then: 

finds  P(Cj|Wi)  =  piJ  -  n^/^ 


The  following  relationships  hold  by  definitions 
^nid“Ni 
v  I  n.  -  N 

'  "  ■  ■  ■  "d  A,  ^  ^ •  •  ■■■■•  *  ■  ' 


(9) 

(10) 


(n) 


It  haa  been  assumed  that  there  exists  at  least  one  document  in 

■  ■„  I  ■  ..  " 

eaoh  oategoiryj  ±.e, ,  the  smallest  possible  ■  l/N.  If  there  were  no 
documents  in  a  category  C  .  then  p.  would  be  zero;  consequently  all  the 
p^e  would  bo  zero.  Suoh  a  category  would  be  of  no  use  and  would  be  dis¬ 
carded.  Having  at  least  one  dooument  in  eaoh  category  also  implies  that 

k  -  1 

k  £  N,  and  that  the  largest  possible  ■  1  -  — ^ — ,  for  there  are  k  -  1 
categories  that  would  have  to  have  the  minimum  p^ .  Therefore : 


and  s 


1  <  _  <  ,  k  -  1 

TJ  *  pd  *  1  "  T“ 

0  S  pid  *  1 


(12) 


is. 3.1. 1.3  Definitions  of  Measures  of  Goodness  -  The  non¬ 
correlation  of  word  occurrence  and  category  or  the  uncertainty  of  cat- 
egory,  given  the  occurrence  of  a  word  W^,  can  be  expressed  by  Shannon's 
formula  for  entropy s 


%  ■  KfCjlwp  .  -Ep.3  log  pi;J 


(13) 
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Thus  a  good  indicator  word  would  have  a  low  H. .  But  is  this  word 

1  A 

supplying  more  information  than  the  total  document  distribution?  Maron 


s'uggests  a  measure  i 


where; 


\  =  H  -  H, 


H  -"^(Cj) 


-  E  Pj  log 


m 

X 


H  is  simply  the  uncertainty  of  categorization  when  no  word  occurrences 
ate  known}  that  is,  H  Is  the  entropy  of  the  a  priori  distribution  of 
all  of  the,  documents . 


This  measure,  however,  does  not  seem  adequate.  Difficulty 
arises  when  the  a  priori  are  unequal  and  have  the  same  numerical 
value  as  the  of  different  categories.  In  this  case,  H  ■  and 
*  0,  which  indicates  a  poor  predictor}  but  Wi  may  aotually  be  a 
good  predictor  in  terms  of  the  given  oriterla.  The  example  in  figure  2 
illustrates  this  difficulty.  Clearly  H  ■  Hj,  and  M^  -  0  in  Figure  2,  but 
is  a  good  predictor  and  supplies  a  great  deal  of  information. 

■i 

More  effeotive  measures  of  the  adequacy  of  an  indicator  word 
oan  be  based  on  a  relative;  entropy  function  of  the  type  found  in 
Watanabe  [81+] ,  This  function  is  similar  to  the  previous  entropy  func¬ 
tions,  but  it  accounts  for  the  a  priori  probabilities  directly.  The 
relative  entropy,  S^,  is  defined  by; 

si  ■  •  -  ]  hi  ^  Sj  <16) 

where  A  is  a  positive  constant  chosen  to  keep  non-negative.  A  should 
be  chosen  such  that  A  ■  l/p^,  where  pg  s  p^  for  all  j,  so  that  S;Lmin  - 


0 


.7 


PrU  ■-  .7 


Pi 


1  •  — A  priori  distribution 

-  -  -  -  Distribution  of  documents 
containing  word  Wr 


FIGURE  2.  Probability  Distributions  for  a  Class  of  Documents  < 

. 

This  condition  means  that  k  *  A  <  N,  since  1/n  C  p  l/k. 

O 


Before  these  measures  are  defined  and  examined, 
funotion  must  be  defined t 

-  -  E  Pj  log  Pj/A  -  H  +  log  A 


one  more  entropy 


(17) 


Three  possible  measures  will  now  be  defined,  in  addition  to  the  measure 
that  Maron  has  suggested. 


*L  "  H  -  Hi 

Mg  n  H  -  S± 
M3  "  HA  ’  Si 


(Maron1  s  measure) 


t  (18) 


M^  ■  log  A  -  Si; 


Nffljs 


*2  -  H  -  ^  -  E  PjJ  log  p., 


M3  ”  H  ■  Hi  ■  J  P1J  lof  p3 


\  V'3  hi loe  h  -  H1  ■ 

w 

It 


-  log  A 
*  Mg  +  log  A 


>  (19) 


The  new  Mg  ^3  ar®  similar  to  except  for  a  cross-term  that  relates 
the  Pj  and  the  p^,  also  has  this  oress-teM.  is  singly  with 


the  constant  term  missing.  The  behavior  of  these  measures  of  goodness 
and  the  various  entropy  functions  are  developed  in  Appendix  A,  Section  9.1. 


U.3.1.1.U  Evaluation  of  the  Measures  -  Measure  M^  was  shown  to 
be  inadequate,  sinoe  it  may  erroneously  indicate  that  a  good  predictor 
is  a  bad  predictor*  In  addition,  M^  can  assume  negative  values .  Mg  can 
also  assume  negative  values,  vhioh  may  make  it  inconvenient  to  use.  Mg 
is  also  inconvenient  to  calculate,  sinoe  it  requires  the  calculation  of 


two  sums. 


4 

1,  E  p.  log  p,  and  Ip.,  log  %  and  since  the  last  summation 

3  J  3  3  W  PJ  ; 

also  includes  a  division  operation*  requires  the  oaloulation  of  these 
eame  sums,  although  it  is  slightly  more  convenient  to  use  sinoe  ie 
alviya  positive.  M^,  Mg,  and  M^  have,  fairly  oomplex  expressions  for 
maajima  and  minima;  and  Mg  become  negative  and  M^  never  reaches  zero. 

M^,  on  the  other  hand,  is  always  positive,  has  a  simple  expression  for 
the  maximum,  has  a  zero  minimum,  end  is  easier  to  calculate  than  the  others. 


An  additional  argument  in  favor  of  M^  is  that  it  can  be  justified 
on  the  basis  of  more  fundamental  definitions  of  information.  This 


justification  can  proceed  in  either  of  two  ways)  on  Lha  basis  of 
probabilities  or  on  the  basis  of  entropies.  In  either  case*  can  be 
shown  to  be  the  amount  of  information  provided  by  the  occurrence  of  word 
i  in  a  document.  The  proofs  are  given  in  Appendix  B,  Section  9.2. 


It  seems  clear,  then,  that  is  the  best  measure  of  the  group, 
both  in  terms  of  ease  of  calculation  and  in  terms  of  theoretical 
justification.  For  these  reasons,  it  will  be  adopted  as  one  of  the  two 
basic  measures  for  category  prediction}  since  there  is  a  different 
for  each  word,  the  notation  will  be  used  instead  of  to  indicate 
the  dependence  of  the  measure  on  the  particular  word  being  considered. 


U.3.1«1.5>  Mathematical  Expression  of  Predictor  Criteria  -  The 
correlation  of  the  occurrence  of  an  indicator  word  in  a  document  and  the 
classification  of  that  document  in  a  particular  category  would  be  meas¬ 
ured  by  H^. 


Hi  ■  -  T  p  log  p±j  (0  a  H±  <  log  k)  (20) 

A  low  inJicates  a  good  predictor)  a  high  H^,  a  bad  predictor. 


A  measure  that  also  accounts 
documents  and  indicates  how  much  more 


for  the  a  priori  distribution  of 
information  the  predictor  supplies 


than  this  distribution  is  M^. 


h  '  5  Pii  >**  ^ 


(0  S  £  -  log  pg) 


(21) 


1  ^  *3 

(1/N  s  P0  £  1/k) 

A  high  M.  indicates  a  good  predictor)  a. lew  M. ,  a  bad  one.  Both  of 

1  4- 

these  measures  must  be  taken  into  account  when  choosing  indicator  words. 
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U. 3. 1.1.6  Predictors  -  On  the  basis  of  these  mathematical 


criteria,  it  is  now  possible  to  select  clues  or  predictors.  A  word  that 
has  a  high  value  for  and  a  lew  value  for  will  be  selected.  The 
cutoff  point  for  these  functions  for  good  predictors  must  be  determined 
experimentally,  It  is  difficult  to  say  how  high  a  value  for  or  how 
low  a  value  for  is  actually  needed  for  a  good  predictor  without  empir¬ 
ical  verification,../  'V/-'' 

Not  only  can  single  words  be  used  as  predictors,  but  word  pairs, 
word  triplets,  and  higher  word  combinations  oan  also  be  used  with  an 
expected  improvement  in  prediction.  The  mathematics  for  these  oases  is 
essentially  the  samej  the  only  difference  is  that  the  occurrence  of 
word  pair  [Wfl  W^]  or  word  triplet  [Wft  WQ]  is  considered  instead  of 
the  single  word  W^.  These  word  pairs  and  word  triplets  oan  be  ranked 
together  with  single  words  on  the  same  scale,  and  their  effectiveness 
as  prediotors  oan  then  be  compared. 

4. 3. 1.1.7  Application  of  Glues  to  Predicting  Categories  -  Once 
the  significant  prediotors  have  been  determined,  it  is  possible  to  obtain 
the  probability  that  a  document  appears  in  a  oategory  on  the  basis  of 
those  predictors.  This  probability  is 5 

P(Cd|WaV....)  (22) 

Maron  gives  an  approximation  to  this  probability.  In  general, 
this  approximation  would  require  a  great  deal  of  calculation.  One  way 
of  approximating  the  probability  would  be  to  take  the  weighted  average 
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of  the  category  probabilities  using  each  of  the  most  significant  indicator 
■words.  Other  functions  of  these  words  might  also  approximate  the  proba¬ 
bility.  Thus,  in  general,  the  predicted  category  would  be  some  function 
Of  the  category  probabilities  for  each  of  the  words.  Methods  for  deter¬ 
mining  suitable  functions  of  this  kind  should  be  investigated. 


U.3.1.1.8  Modification  of  Categories  -  Implied  in  this  discussion 
are  criteria  for  modifying  and  combining  categories  to  get  better  classi¬ 
fication,  What  is  needed  is  a  set  of  categories  that  would  be  strongly 
correlated  with  word  occurrence  and  that  would  yield  approximately  equal 


a  priori  category  probabilities.  In  this  way,  there  would  be  words  with 
high  M^  and  low  K^.  In  faot,  these  two  measures  would  then  be  almost 
the  samej  for  if  ■  l/k  for  all  j,  then: 


Mi  “  E  plj  log  pij  +  log  lc  "  log  k  "  Hi 
i) 


(23) 


Thus  in  equalizing  the  categories,  if  for  some  W^,  Mi  is  high  and  there 
exists  at  least  one  such  for  each  categoiy,  then  the  classification 
would  be  a  good  one. 


U. 3.1.2  Extension  of  Concepts  to  Inolude  Word  Frequency  Infor¬ 
mation  -  There  are  several  ways  in  which  word  frequency  information  can 
be  taken  into  account  to  determine  good  predictors  of  document  categories, 
Tho  first  two  methods  use  absolute  values  of  word  occurrence  in  a  document, 
while  the  third  method  uses  relative  void  frequency  in  a  document  to 
obtain  more  information. 


$9 


Nows 


U„3 .1.2.1  Additional  Definition  -  Let: 

N  =  the  total  number  of  documents  in  the  initial  group. 

N.  ~  the  number  of  documents  in  which  word  ¥.  occurs, 
i  1 

1L  (x)  =  the  number  of  documents  in  which  word  VI  occurs  x  times, 
n.  =  the  number  of  documents  in  category  C. 


3 


d 


n.  .  ■  the  number  of  documents  in  category  C .  which  have 


word  W^, 


n.  (x)  ■  the  number  of  documents  in  category  C. 

word  x  times .  J 


Hj  -  I  Kj_(x) 


”ij  '  ?  “i;|W 


which  have 


(2U) 


In  addition  to  the  probabilities  and  py  the  following  prob¬ 
abilities  oan  be  defined.  Let: 

p^  ■  the  probability  that  a  dooument  contains  word  W^. 

p. (x)  ■  the  probability  that  a  document  contains  word 
1  x  times.  \ 

p.  .  (x)  -  the  probability  that  a  dooumant  containing  word 
1 J  W,^  x  timss  falls  into  category  . 


p(C  .»W.)  -  the  joint  probability  that  a  d 
^  1  egory  and  contains  word  W^. 

pCOjiX,  (x)]  ■  the  joint  probability  that  a 
^  egory  C_.  and  contains  word  ¥ 


dooument  is  in  oat- 


'j  ~~  - - 

Then  the  probabilities  can  be  approximated  as  follows: 


document  is  in  cat- 
x  times. 


Ni 

T 


pl 

P.1‘4 


(25) 


6  0 


•,i  ■  £ 


N,(x) 

P±(*)  » 


(x) 


-  rfer 


p(°ti»wi)  ■  *4^ 

A,  .(x) 

pCO,;Wi(x)]  -  -ip_ 


Of  ooui'-Sws 


Pi  “  2  1\M 

p(oJ,w1)  ■  £  PCOj*W±(x)] 

X 

and  p^j(x)  is  related  to  p.,  A  by  the  expression} 


(26) 


■ij 


id 

E  Pi;J(x)  Hj_(x) 

“'TTC(x) - 

x  1 


(27) 


U. 3. 1.2. 2  Derivation  of  Measures 

(a)  Method  1  -  The  measures  Hj_  and  can  easily  be  generalized  to 
include  frequency  information  by  considering  word  occurring 
x  and  only  x  times  in  a  document  as  a  clue.  Then,  Instead  of 
using  in  and  M^,  a  new  probability  p^^(x)  can  be  used. 
Two  new  measures,  (x)  and  M^(x),  can  now  be  defined; 
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(28) 


H  (x)  =  -  r  PH_.(x)  log  p  ,(x) 
x  j  3j 


PnW 


Mi(x)  =  ^pi3(x)  log^ 


With  these  measures,  the  effectiveness  of  word  W1  as  a  predictor, 
when  it  occurs  x  times  in  a  document,  can  be  evaluated.  As 
before,  H^(x)  must  be  low  and  M^(x)  must  be  high  for  a  good 

j-i.  UU4.W  W-’A  A 


The  average  effectiveness  of  a  word  as  a  predictor  can  be 
measured  byt 


%(x)  -  ^(ac))* 


(29) 


^(x)  -  )x 

where  (f(a))a  denotes  the  probabilistically  weighted  average 
value  of  the  funotion  f  over  its  domain.  Then,  on  the  basis 
of  Equations  (2£)  and  (26),  it  follows  thati 


_  E  P,(x)  H,(x) 

V*>  ■  ”  S  PiW . 

X 


(30) 


and} 


^(x)  ■  -  --SE  pCCj^Cx))  log  Pi;)(x) 
i  x  j 


Similarly? 


K.(x) 


E  Pi(x)  M^x) 

~  ^  p  j£j - 

x  1 


(31) 


(32) 
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But/! 


U)  +  H.  (x)  «  -  T,  P  (x)  log  p 


therefore; 


<Mi(x)  ♦  H^x)^  -  Mi(x)  +  H±(x) 

p[CJ,Wi(x)]  log  p 

■  -  5[  3  *°yV  108  PJ 

and,  by  substituting  Equation  (2 $); 

^(x)  +  %(*)  -  -  E  log  pj 
«] 

But: 

\  ■  3  P1J  108  p3 

therefore} 

M^(x)  +  H^(x)  -  Mj_  +  ^ 


(33) 


(31;) 

(3?) 

(36) 


(37) 


(b)  Method  2  -  This  method  is  similar  to  Method  1.  Instead  of  con¬ 
sidering  that  a  word  occurs  exactly  x  times  in  a  document,  this 
method  considers  that  a  word  occurs  between  x&  and  x^  times  in 
a  document.  In  other  words,  word  frequency  information  is 
grouped  in  intervals  of  frequency  of  occurrence,  Br<  For  example, 
the  frequency  intervals  might  be  1-5  times,  6-10  times,  etc. 


New  probabilities  must  be  introduced.  Lets 

Pi(Br)  =  the  probability  that  a  document  contains  word 

W.  x  times,  where  x  is  in  interval  B  . 
x  r 
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=  the  probability  that  a  document  containing  word 
¥.  x  times  falls  into  category  C where  x  is 

•*-  J 

in  interval  B_. 

r 


p[Cj,Wi(Br)] 


the  joint  probability  that  a  document  is  in 
category  and  contains  word  x  times,  where 

x  is  in  interval  B_. 


Now  the  probabilities  can  be  expressed  ass 


Pj  (B  )  -  E  p,  (x) 
x  g  Br  1 

p[Cj»Wi(Br):!  "  B  PE0J»Wi(x)] 

E  Pi^x)  N.(x) 
x  €  B  13  1 

pij(V - E  - 

*  €  K 


r  pCo,,w1(x)3 
c  n  3  1 


x  €  B, 


"  S  pTTxT 

Mi  •** 


x  €  B. 


>  (38) 


Than,  following  Method  1  and  Equation  (28),  expressions  may  be 
written  for  H^(Br)  and  M^(Br). 


W  '  -  5  *ij<V  l0B  %3(Br> 

J 

P,,(B_) 

W  ‘  B  VBr>  *>« 


}  (39) 


H^(By)  should  be  low  and  M^(Br)  should  be  high  for  a  good 
predictor. 


Another  set  of  functions  that  measure  the  effectiveness  of  word 
TnI  as  a  predictor,  when  occurs  x  times  and  x  is  in  interval 


B  ,  can  be  obtained  by  taking  the  average  values  of  H^(x)  and 

M.  (x)  ov< 
x 

ured  by: 


M^(x)  over  the  interval  B^.  The  average  effectiveness  is  mas- 


H.(x#r)  =  <K^(x)  )x  €  B 


B  g  B 


Then,  by  using  Equation  (33)  as  in  Method  Is 

E  P,  (x)  p, ,  (x)  log  p 
*  x  6  B  1  3 

<h(*>  *  s  e - - - s — prra - 

r  IS!.  1 


■  ■  J  rC°j«wi<Brh  l0«  »j 
•  ■  ?  w  **  Pj 


Butt 


W  +  W  "  -  B  Pij'V  10 « P3 

therefore; 

W  +  Mi(Br)  "  <Hl(x)  +  €  B 


(ho) 


(U) 


m 


-  H^r)  +  l^(x#r)  (U3) 

If  this  quantity  [^(Bj.)  +  is  averaged  over  all  r,  then 

by  the  proof  outlined  for  Method  1: 


65 


040 


H.  +  M.  3  H,  (x)  +  M.  (x) 

1  1  i  x 

■  W  *  \(Br) 

-  *  <l^(x,r)>r 

Thus  the  sum  of  the  averages  of  the  tvo  measures  remains  constant 
anti  is  independent  of  the  size  of  the  intervals  or  frequency  of 
occurrence. 

(e)  Method  3  -  This  method  considers  the  number  of  times  a  word 
appears  in  a  dooumont  in  relation  to  the  total  number  of  words 
in  a  document  as  a  clue.  Using  this  relative  frequency  infor¬ 
mation  as  dues  should  provide  even  better  category  prediction 
than  word  occurrence  or  simple  word  frequenoy  information. 

Let  f  be  the  relative  frequenoy  of  a  word  in  a  document;  tl» 
relative  frequency  Is  the  ratio  of  the  number  of  occurrences  of 
the  word  in  tbs  document  tii\  the  total  number  of  words  in  the 
document.  Let  f  be  an  interval  of  relative  frequencies,  where 
the  interval  lb  defined  by  the  limits  fft  and  f^.  Then,  p^(fB) 
is  simply  the  probability  of  word  occurring  in  a  document 
with  a  relative  frequenoy  in  the  interval  f  ,  and  p . , (f  )  is 

S  Ij  s 

the  probability  that  a  document  falls  in  category  C^,  given 

that  the  document  contains  word  with  a  relative  frequency 

within  the  interval  f  . 

s 

The  probabilities  p, (f  )  and  p. .(f)  arc  approximated  by: 

X  S  Ij  3 
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M3 


I 

8 

1 

I 


N.  (f  ) 

p.(f  )  -  1  3 

x  s' 


N 


ij 


n.,  ,(f  ) 

<V '  rfs 


(W) 


where  K^(fg)  ip  the  number  of  documents  containing  word  with 

a  relative  frequency  within  the  interval  fs,  and  n^fg)  is  the 

number  of  documents  in  category  containing  word  with  a 

relative  frequency  within  the  interval  f  , 

8 


Following  the  previous  analyses,  expressions  for  (f g )  and 


(f g )  can  be  wrlttens 


W  ■  -  *  VV  **  ByO.) 


(U6) 


By  analogy  to  the  proofs  developed  for  Methods  1  and  2,  \(fQ) 


(f ^ )  can  be  calculated  where! 


w  -  <w>. 


w  ■  <w\ 

Since,  as  compared  to  Equation  (33): 

W  +  W  "  -  Z  Pij(fs>  lQg  Pj 

then; 


(U7) 


m 
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<M.(fs)  +  H.(fs)>s  =  M.(fs)  +  H.(fg) 


=  -SPi,  log  P1 
i  1,1  0 


Therefore,  as  before: 

w  +  a  Mi +  Hi 


(1*9) 

($0) 


One  of  tbB  major  experimental  problems  is  the  proper  selection 
of  frequency  intervals  to  evaluate.  For  some « areas  of  the  rela¬ 
tive  frequenoy  spectrum  a  small  change  in  interval  size  might 
lead  to  a  large  change  in  effectiveness j  for  other  areas  of 
the  spectrum,  however,  changing  the  interval  might  have  a  neg¬ 
ligible  effect  on  effectiveness.  These  Intervals  will  in  gen¬ 
eral  not  be  uniform  over  the  spectrum  and  will  be  different  for 
eaoh  word.  Although  this  selection  and  evaluation  appears  dif- 

i  ! 

fioult,  it  will  lead  to  better  category  prediction. 


U, 3. 1.2. 3  Improvement  in  Effectiveness  -  We  have  previously 
defined  and  used  measures  to  indicate  the  improvement  in  effectiveness 
of  prediction  using  word  or  word  frequenoy  information  rather  than  simple 
category  statistics  alone.  Instead  of  evaluating  word  frequency  infor¬ 
mation  with  respect  to  simple  category  statistics,  this  information  may 
be  evaluated  with  respect  to  word  occurrence  information.  This  new  meas¬ 
ure  would  indicate  how  much  more  information  the  word  frequency  informa¬ 
tion  supplied  than  the  word  occurrence  information.  Call  this  measure 

M.  (x)  where 
la 
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(5l) 


Now: 


p,,(x) 

M.  (x)  =  £  p. ,(x)  log  - 

ia  j  10  Pij 


Mi&(x)  =  -  £  Pi-jte)  !og  Pi;j  -  Vx) 


(52) 


but  this  relationship  does  not  seem  very  meaningful.  The  important 

V 

relationships  would  relate  M^x)  with  M^(x)  and  Consider  the 
quantity  M^x)  -  Kj_a(x) .  Now  ' 

/v  pi1<*)  ,  v  P±1(x) 

-  Vx)  -  r  Plj(x)  log  -  j--  -  r  p  (x)  log 


■j  3 


"id 


-  E  PijCx)  log  t~£ 

i  10  pj 

Let  us  now  take  the  average  of  this  quantity  over  x.  Thent 

PH 

r  pAx)  e  Pn(*)  i°b  s-4 

-  V*»*  * 1 - 1- 


(53) 


where 


Now: 


where 


p. (x)  ■  the  i  robabillty  that  a  document  contains  word  W. 
1  x  times. 


E  Pi(x)  -  pA  and  £  pt(x)  p^x)  B  p(C^,W±) 


(55) 


p,  ■  the  probability  that  a  document  contains  word  W. , 
1  and 

p(C  ,  /r. )  »  the  joint  probability  that  a  document  is  in  cate- 
**  1  gory  and  contains  word  , 
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Thus  s 


But} 


Thent 


Therefore} 


<Mi«  "MxaW4  =  5"JP(VWi)l0^ 


M.p 

Pi  Pij 


<\W  -  Vx)>x  -  5  pij  lo«  jj*  "  Hl 


(56) 


<\(*)  "  ^00  >x  “  Vx)  “  Mia(x)  "  \ 


(57) 


(58) 


(59) 


Now  la  a  measure  of  the  Information  that  supplies  about 
the  categorization.  In  addition! 

— 77  /  PnWv  /  P(c^»W,(x)) 

■  <l0«  ■  <loe  -p^y-p,  <*» 


where 


p(G .  ,W.  (x) )  -  the  joint  probability  that  a  document  is  in 
J  category  and  contains  word  x  times . 


M^(x)  closely  resembles  an  information  funotion,  and  is  a  meas¬ 
ure  of  the  average  information  that  occurring  x  times  supplies  about 
the  categorization.  Nows 


M 


.  P....W.  .  P(C,,W.(x)) 

ia(x)  -  <log  ifr-Kj =  <log  -p fairrks 


piw  pu 


(61) 


M^fi(x)  then  represents  the  average  information  that  occurring 
x  times  supplies  about  the  categorization,  knowing  that  the  document 


7D 


contains  Vh  at  least  once,  i.e.,  M^a(x)  represents  the  average  information 
that  word  frequency  information  supplies  above  the  word  occurrence  infor¬ 
mation  .  Thus  the  equations' 


■  "i »  V*> 

can  be  expressed  verbally  as  follows  s 

Information  about  C. 


Average  information 
about  Cj  supplied 
by  word  frequency 
information. 


supplied  by  word  oc¬ 
currence  Information, 


(62) 


Average  information 
about  Cj  supplied 
by  word  frequency 
Information  when 
word  ooourrence  is 
known. 


This  equation  satisfies  our  intuitive  notions  about  information  and  the 
additivity  of  information.  The  equation  justifies  to  some  extent  the 
oholo,  of  t»o  particular  irformtion  maaooraa  «,(*),  and 


In  a  similar  manner,  the  equation  for  the  relative  frequency 
case  oan  be  developed.  Therefore! 

-  iyv  m 

where  f  indicates  an  interval  of  relative  frequencies, 
s 

It. 3. 2  Dane  -Theoretic  Aspeots  of  Clue  Word  Selection  -  The  motivation 
for  the  work  described  in  this  section  arose  from  a  consideration  of  the 
role  of  clue  word  selection  in  an  operational  document  classification 
system.  In  such  a  system,  the  problem  is  to  optimize  the  probability  of 
correct  classification]  among  the  parameters  that  can  be  varied  are  the 
number  of  clue  words  chosen,  the  particular  algorithm  used  for  employing 
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these  words,  and  the  particular  words  used.  The  principal  constraints 
are  the  cost  c "  processing  and  the  amount  of  information  available  about 
the  relationship  of  the  clue  words  to  categories.  The  information- 
theoretic  approach  described  in  the  previous  reports  presents  a  method 
of  ranking  clue  words  relative  to  each  other  in  terms  of  information- 
theoretic  measures.  The  game-theoretic  approach  yields  somewhat  more 
specific  advice  on  tow  to  choose  clue  words,  but  the  necessary  data  for 
the  ehoioe  seems  to  be  far  more;  obscure .  The  best  one  would  hope  for 
would  be  a  game -theoretic  justification  of  the  information-theoretic 
measures,  in  which  a  maximum  payoff  would  be  achieved  by  maximizing  (or 
minimizing)  a  function  whose  arguments  are  Information-theoretic  measures. 
In  an  attempt  to  achieve  this  goal  we  have  analyzed  a  number  of  specific 
oases,  which  are  described  below. 

U.3.2,1  The  Approach  -  Consider  the  classification  problem  to 
be  a  two-person  game  in  which  nature  is  one  of  the  players .  Further 
consider  that  the  probability  that  nature  is  in  a  particular  state  in 
this  game  is  known,  There  are  a  number  of  aots  that  the  player  may 
perform  and  associated  with  eaoh  act,  for  each  particular  state  of 
naoure,  there  is  a  certain  utility.  Lets 

State  0^  "  the  document  is  in  category  j 
Act  Ar  ■  put  the  document  into  category  r 
Utility  u^  «  the  utility  of  act  Ap  when  nature  is  in  state 

Thus  we  have  a  k  x  k  utility  matrix  of  the  following  form; 


12 


*1 

h. 


U11  ^12  •  •  *  ulk 

^21  ^22  °  *  * 


W 


\k 


Let  p(Cj)  be  the  probability  that  nature  is  in  state  Cj  (nature  plays 
a  mixed  strategy,  playing  Gj  with  probability  p(C^)).  Then  the  utility 
of  act  Ar,  tlCAy),  can  be  written: 


u(Ar)  -  r  m 

j 

Utilities  would  be  calculated  for  each  act,  and  the  act  with  the  highest 
utility  performed. 


Consider  a  further  addition  to  our  model.  Before  we  ohose  an 
aot,  we  are  permitted  to  perform  an  experiment  e  which  has  outoomes  0^. 
Consider  also  that  we  have  determined  statistically  the  probabilities  of 
the  states  of  nature  when  the  outcome  0^  has  occurred.  These  probabilities, 
written  p(Cj | 8^)^  might  have  to  be  determined  using  Bayes'  rule  and  the 
probabilities  p(8.jJCj),  which  are  generally  more  easily  available  (or 
deducible ) .  Then : 


V 


P(0i|C1)  P(C1) 

rpryvp(on7 


m 


n 

The  utility  of  act  Ar,  given  that  the  experiment  e  has  had  an  outcome 


0^  is  now: 

u(ArlV 


(66) 
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These  remarks  may  he  related  to  the  classification  problem  if 
we  consider  the  experiment  e  to  be  the  scanning  of  a  document  hunting  for 
a  clue  word.  If  the  first  clue  word  we  find  is  W^,  then  the  outcome  of 
the  experiment  is  0^.  In  this  case,  the  utility  u^  is  unity  if  r  ■  j, 
i.e,,  if  the  selected  category  r  is  the  same  as  the  correct  category  j, 
and  0  otherwise,  In  other  words, 

\)  '  ‘rj  <«> 

where  6  is  the  Kronecker  delta  funetion.  We  then  have 

U(Aj|ei)  -  p(Cjj01)  (68) 


We  want  to  pick  the  value  of  U(A^| 9^)  that  is  maximum  for 
given  0^.  This  iss 

Max  U(Aj|01)  “  Max  P(0j|  e±) 

J  d 


(69) 


U.3.2,2  Maximization  of  Oorreot  Claasifjoation  -  Consider  a 
classifioation  procedure  based  on  two  experiments  and  *£,  checks 
to  see  if ; word  is  present  in  a  document  and  ohsoks  to  see  if  word 
Wg  is  present.  This  procedure  is  represented  in  Figure  3.  For  eaoh  of 
the  four  possible  outcomes  we  get  a  set  of  probabilities, 

'{^(C.jlW.jWg)},  |p(Cj|w^W2)},  and  {p(0j  where  W^  indicates  the 

absence  of  word  W^,  In  accordance  with  this  procedure,  we  would  choose 
those  categories  which  have  the  highest  probability  of  each  set. 


Let  us  now  group  the  documents  we  are  trying  to  classify  on  the 
basis  of  these  experiments .  In  group  1,  where  and  Wg  are  present, 
there  are  on  the  average  the  fraction  p(W^Wg)  of  the  total  number  of 


Ik 


Cg  ■  Is  Wg  present? 

Choose  Ar  suoh  that  ML J  p(Cr|WiWg)  »  max  p(Cj|WjWg) 

(T)  p(oBjiy4)  -  max  p(Cj|iyr') 
0  P(t!tK«2)  -««P(Oa|W>f2) 

J 

0  P<°TK*2>  -  max  pCc^jW^Wg) 


FIGURE  3.  A  Decision  Procedure  for  Category  Selection 


documents .  If,  for  every  document  in  this  group  we  perform  Ap,  we  will 
correctly  classify  the  fraction  P(Cr|  00  of  this  group.  Thus  the  total 


7.5 


fraction  of  correctly  classified  documents  is,  on  the  average: 


o  -  P(crjw1w2)  pCw^)  +  P(cs|w1w2)  p(wxw2) 

+  p(ct|w;[w2)  P(w^w2)  +  pCCjW-jWg)  pCwJwg) 

-  p(CrW1W2)  +  pCC^w')  +  p(CtW^2)  +  p(Cvw|w2)  (70) 

But  A  ,  A  ,  etc.,  are  optimal  and  so  the  conditional  probabilities  are 
r  s 

the  maxima.  Then: 

Qq  ■  max  p(CjW^W2)  +  max  pfC^WjWg)  +  max  p(CjW-jw2) 

«j  «)  i 

+  max  pfCjWjWg)  (71) 

«) 

In  general  if  {ol^I  represents  the  set  of  outoomes, 

Q0  -  I  max  p(C.a1)  (72) 

tj  i  J  " 

Then  we  want  to  ohoose  our  experiments  such  that  the  outoomes  lead  to 

a  maximum  value  of  0  . 

o 


Let  the  series  of  experiments  c^,Cg> •  •  associated  with  clue 
words  WpWg, ,  ,  ,,Wn  be  oalled  the  experiment  a*  Let  the  possible  outoomes 
of  this  experiment  be  designated  [a^}  as  before.  If  we  take  different 
combinations  of  words,  we  will  generate  an  associated  set  of  experiments 
(a).  Let  y  ts  all  possible  experiments  we  can  generate  in  this  manner. 
Then: 


G 


omax 


max  X  max  p(C  -a.. ) 

afyi  3  3 


(73) 


In  general  it  would  be  very  difficult  to  find  the  best  set  of 
words,  for  every  set  would  have  to  be  examined.  This  procedure  is 
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clearly  different  from  tuts  information-theoretical  methods  and  in  general 
may  lead  to  somewhat  different  results .  What  the  differences  are  and 
why  they  occur  should  be  investigated,  but  it  might  be  possible  to  derive 
the  information  theoretic  formulae  from  the  game  theoretic  approach  if 
appropriate  utility  values  are  used. 

U.3.2.3  Departures  from  the  Ideal  Procedure  -  In  the  previous 
analysis  we  have  assumed  that  all  of  the  conditional  statistics  were 
obtainable .  This  condition  may  not  be  the  ease,  however.  Only  a  partial 
set  of  conditional  probabilities  may  be  obtainable,  or  it  may  not  be 
practical  to  obtain  them.  It  may  also  be  impractical  to  perform  the 
complete  set  of  experiments  on  all  documents. 

4. 3. 2.3.1  Simplifying  the  Choice  Rule  -  Consider  the  following 
example,  in  which  documents  having  are  not  tested  for  Wg  (see  Figure  k)  . 
Then  the  decision  procedure  would  be  similar  to  the  first  case,  and  we 
would  obtain  the  total  fraction  of  oorrectly  classified  documents  for  the 
optimal  procedure,  Gg0i 

G2o  ■  max  P(CjW1)  +  max  pCC^Wg)  +  max  PCCjW-jWg)  (7U) 

J  J  d 

The  degradation  in  result  oaused  by  not  testing  all  documents 

for  VL  is  Q  ~  CL  : 
c,  o  20 

Go  -  Q2o  *  “J*  +  max  PtCjtfjWg)  -  max  pCC^)  (?$) 

The  maximum  of  GgQ  would  be  found  by  trying  all  possible  word  combina¬ 
tions  as  in  the  first  case.  If  the  reason  for  not  performing  the  Wg 
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{(p(cXv) 


.{pto  !»;«;)} 


FIGURE  It.  A  Seoond  Procedure  for  Classification 


test  is  the  oost  of  the  test,  then  certainly  should  be  ohoBen  suoh 
that  p(W1)  is  large  and  fever  documents  would  need  two  tests.  This  con¬ 
sideration  oan  be  introduced  into  the  equation  by  including  a  testing 


oost  factor  in  G2o« 

it. 3. 2. 3. 2  Laok  of  Information  -  Consider  the  situation  in  which 
the  only  sets  of  probabilities  available  are  ^p(Cj)|,  {p(C^ |W^)|,  and 
{pCC^Wg)}-.  Also  assume  that  will  be  tested  for  first  and  only  those 
documents  not  having  will  be  tested  for  Wg.  This  procedure  is  shown 
in  Figure  The  fraction  of  correctly  classified  documents  if  categorf.es 
a^,  a^,  and  a^  are  chosen  for  the  respective  groupings  is: 

G3  M  P(a1w1)  +  p(a2W^W2)  +  p(ay^Wg)  (76) 
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eg  ■  Is  Wg  present? 

FIGURE  0.  A  Third  Procedure  for  Glassification 


It  would  seexn  reasonable  to  choose 

a^  on  the  basis  of  p(a^jW^)  -  max  pCCjjW^)} 

0 

a2  on  the  basis  of  P^l^j)  ■  max  pCC^Wg)} 

j 

and 

a^  on  the  basis  of  p(a^)  ■  max  p(Cj), 


s 


>  (77) 


because  this  method  seems  to  make  the  best  use  of  the  available  informa¬ 
tion.  Then  to  calculate  the  maximum  value  of  G.  over  the  entire  set  of 

3o 

clue  word  combinations  would  seem  practically  quite  difficult,  although 
conceptually  it  appears  easy. 


U.3.2.U  Summary  ~  Some  game  theoretical  considerations  of  the 
classification  problem  have  been  presented.  It  seems  that  any  theoretical 


79 


analysis  of  the  "non-ideal"  case  is  extremely  limited]  however,  this 
should  be  investigated  further.  It  is  not  yet  clear  how  the  information- 
theoretical  and  game -theoretical  approaches  are  related;  but  if  there  is 
a  simple  relationship  in  the  ideal  case,  it  may  shed  light  on  a  combined 
approach  fcir'  the  non-ideal  oases. 

H.3.3  An  Approach  to  a  Criterion  for  Automatically  Generated  Extracts  - 
Automatic  extracting  was  originally  described  by  Luhn  [H6]  some  time  ago. 
While  he  refers  to  the  end  products  of  his  process  as  abstracts,  they  are 
mors  accurately  characterized  as  extracts  of  what  are  hopefully  the  more 
central,  critical,  or  descriptive  sentences  in  a  document.  Luhn's  tech¬ 
nique  is  purely  statistical.  Sentences  are  selected  for  extracting  on 
the  basis  of  two  related  facts  about  their  word  content: 

(a)  The  relative  frequency  of  the  words  in  the  sentenoe,  except  for 
common  words. 

(b)  The  distance  between  high  frequency  words  in,  the  sentence,  based 
upon  the  number  of  intervening  non-clue  words. 

While  Luhn  present  a  rather  vague  theoretical  rationale  for  the 
validity  of  suoh  an  approaoh,  there  is  no  attempt  to  justify  it  in  detail, 
exoept  on  the  grounds  that  it  can  produee  useful  extracts.  No  attempt  is 
made  to  show  whether  extracts  generated  by  any  other  technique  are  more 
or  less  useful.  Recently  Quiliano  et  al  [22]  at  Arthur  D,  Little  have 
proposed  a  technique  for  incorporating  syntactic  information  into  the 
distance  measure  in  order  to  make  the  technique  more  useful. 

There  seem  to  be  two  things  lacking  in  this  approach  to  automatic 
abstracting  or  extracting: 
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(a)  A  lack  of  any  criterion  or  perhaps  of  multiple  criteria,  depending 
on  the  context  in  which  the  extract  is  to  be  used,  for  determining 
the  adequacy  of  any  given  extract  or  extracting  scheme, 

(b)  A  lack  of  understanding  of  the  fundamental  processes  involved  in 
human  abstracting,  extracting,  condensation,  or  perception  of 
statement  saliency  in  a  longer  argument  of  presentation. 

It  would  seem  that  a  combination  of  the  approach  of  Newell  and  Simon 
[53]  to  the  simulation  of  cognitive  processes — theorem  proving  and  prob¬ 
lem  solving  more  generally— and  the  approach  of  Maron  [U?]  to  the  automatic 
classification  of  documents  might  be  appropriate.  While  each  of  these 
studies  is  well  known,  it  might  be  appropriate  to  indicate  briefly  which 
aspects  of  their  methodology  are  relevant  to  alleviating  the  two  short¬ 
comings  in  present  automatic  extracting  systems. 

i 

Newell  et  al,  in  order  to  simulate  cognitive  functioning,  first  used 
a  method  of  observation  and  introspection  to  gain  insight  Into  the  method 
by  which  humans  proved  logio  theorems .  In  the  context  of  information 
retrieval  the  major  emphasis  is  on  useful  extraction  rather  than  on  the 
simulation  of  human  extraction.  It  may  nevertheless  pay  to  observe  human 
extraotlng  behavior  in  order  to  develop  more  useful  algorithms  for  obtain¬ 
ing  automatic  extraots . 

The  work  of  Maron  and  Kuhns  has  already  been  described  in  previous 
reports.  It  involved  the  use  of  human  classification  of  a  set  of  items 
as  a  criteria  for  automatic  classification.  The  automatic  classifica¬ 
tion,  however,  was  not  based  on  the  unknown  techniques  of  the  human 
classifiers.  The  automatic  algorithm  was  based  rather  upon  purely  sta¬ 
tistical  features  of  some  of  the  classified  documents.  Human  classifica¬ 
tion  was  also  available,  however,  to  provide  the  criteria  for  checking 


the  adequacy  of  the  automatic  algorithm  once  it  was  derived. 


In  the  case  of  automatic  extracting  both  of  these  techniques  might 
prove  useful.  That  is,  the  use  of  observation  and  introspection  would 
help  alleviate  the  difficulty  caused  by  the  lack  of  understanding  of 
human  functions  and  allow  for  the  development  of  more  rational  extract¬ 
ing  algorithms.  Perhaps  these  techniques  could  be  ultimately  extended 
to  abstracting  per  se.  The  records  of  humanly  generated  extracts  could 
be  used  as  a  criterion  for  evaluating  the  adequacy  of  various  automatic 
algorithms.  The  latter  would  alleviate  the  difficulty  oaused  by  the 
non-existence  of  suitable  criteria. 


The  paradigm  for  suoh  research  and  development  would  be  as  follows  t 

(a)  A  series  of  documents,  either  large  texts  or  shorter  artioles 
for  researoh  convenience,  would  be  sclacted  for  extracting. 

(b)  Ground  rules  for  desired  extracts  would  be  developed*  e.g.s 

(1)  How  long  should  each  extract  be?  Should  it  be  some  fixed 
proportion  of  the  total  document? 

(2)  What  sentential  units  should  be  extracted?  Whole  sentences 
only?  Parts  of  sentences?  Parts  that  can  be  recombined 

to  form  larger  sentences? 

(3)  What  is  the  focal  purpose  of  the  extraot?  To  extract  as 
much  factual  information  as  possible  within  the  limits 
imposed  by  the  length  of  an  extraot?  To  characterize  the 
document  as  well  as  possible  in  order  that  the  reader 
might  know  what  information  it  contains?  Both  of  these? 

(I*)  What  information  or  techniques  may  be  used  in  generating 
the  extraot?  Anything  that  occurs  to  the  user  based  upon 
his  total  knowledge?  Anything  based  on  the  explicit  and 
implicit  content  of  the  document?  Only  explicit  content? 
Only  Rigorously  formulated  rules? 

(c)  The  documents  would  then  be  subjected  to  human  extracting  using 
instructions  based  upon  the  ground  rules. 
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(d)  A  portion  of  the  humanly  extracted  documents  would  be  carefully 
subjected  to  introspective  report  and  an  analysis  of  the  implicit 
rules  followed  in  extracting. 

(e)  Based  on  this  analysis,  one  or  several  automatic  algorithms 
would  be  developed  for  achieving  essentially  the  same  extracts 
from  readily  treated  information  in  the  documents .  For  the  sake 
of  generality,  an  attempt  would  also  be  made  to  incorporate  those 
rules  manifest  in  introspective  protocols  that  could  be  handled 
by  computers. 

(f)  Measures  of  correspondence  between  humanly  and  automatically 
generated  extracts  would  then  be  developed, 

(f)  Finally,  the  automated  techniques  would  be  applied  to  the  remain¬ 
ing  documents  in  the  sample  and  the  extracts  generated  would  be 
validated  against  the  criterion  of  the  human  extracts  already 
available , 


While  this  approaoh  depends  upon  research  and  development  strategies 
already  developed  by  others,  its  application  to  the  information  retrieval 
problem  is  unique.  Further  researoh  along  these  lines  appears  warranted. 

h.k  FILE  STRUCTURE 

File  structure  is  ooncemed  with  the  organization  oi  document  descrip¬ 
tions  in  a  storage  medium,  The  assumption  has  generally  been  that  the 
storage  medium  is  attached  to  a  computer,  though  much  of  the  work  can  be 
applied  more  generally.  With  every  file  organization  there  is  associated 
an  algorithm  for  obtaining  the  addresses  of  those  documents  that  satisfy 
a  given  description.  The  file  structure  depends,  of  course,  on  the 
desorlptive  structure  to  be  used.  File  structure  is  concerned  only 
peripherally  with  the  method  by  which  descriptions  are  assigned  to 
documents . 


In  this  section,  two  topics  will  be  presented.  The  first  and  major 
topic  is  a  mathematical  analysis  and  discussion  of  the  efficiency  of 
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certain  types  of  file  organizations.  The  second  topic  is  a  description 
and  evaluation  of  the  Multi-List  system. 

l+.l+.l  Comparative  Analysis  of  Some  File  Organizations 

l;.U,l.l  Introduction  -  This  section  contains  a  discussion  of 
a  number  of  file  organizations  that  may  be  suitable  for  the  retrieval  of 
documents  or  other  items  of  information.  The  exposition  largely  follows 
the  order  of  mathematical  development  rather  than  some  didactic  organiza¬ 
tion  for  easily  oommunicating  the  results.  This  method  of  exposition  is 
used  because  it  is  impossible  in  work  of  this  kind  to  know  at  the  begin¬ 
ning  where  fruitful  mathematical  analysis  will  lead. 

For  each  file  structure  considered,  expressions  are  derived  for 
the  average  or  expected  values  of  the  number  of  items  and  the  subject  or 
oategory  headings  examined  to  retrieve  a  single  item,  known  to  be  In  the 
file,  in  response  to  a  request.  The  file  organizations  are  then  compared 
and  evaluated  in  terms  of  these  expeoted  values  for  a  wide  range  of  file 
sizes.  To  aid  in  the  comparison,  variances  are  derived  and  plotted. 

Three  different  types  of  file  organizations  or  structures  will 
be  compared.  They  arei 

(a)  Single-level  subject  headings. 

(b)  Hierarchical  trees  of  items. 

(c)  Hierarchical  trees  of  subject  headings. 

The  first  type  consists  of  a  single  level  of  unrelated  subject  headings 
or  category  names  under  which  items  are  grouped  or  filed.  Both  the  order 
of  subject  headings  within  the  file  and  the  order  of  items  within  a  sub¬ 
ject  are  random. 
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The  second  type  of  file  organization  is  a  multi -level  tree  cl 
items .  The  connectivity  of  the  tree  does  not  necessarily  imply  a  cor¬ 
responding  logical  relation  among  the  items. 

The  tree  of  subject  headings,  on  the  other  hand,  is  a  multi¬ 
level  categorization  of  subject  headings  where  each  heading  is  divided 
into  two  or  more  sub -headings  down  to  the  lowest  level  of  detail.  The 
tree  of  subject  headings  is  intended  to  imply  the  logical  relation  among 
them.  The  items  may  be  filed  in  a  linear  sequence  or  in  a  hierarchical 
tree  under  the  last  row  of  headings. 

More  than  one  may  of  searching  the  nodes  of  a  tree  will  be  used. 
Further  subdivisions  of  the  three  types  of  file  organizations  will  be 
discussed  in  the  following  detailed  analysis.  Trees  of  both  items  and 
subject  headings  will  be  considered  in  various  cases  in  the  section  on 
hierarchical  trees.  First,  however,  single-level  subject  headings  will 
be  analyzed.  This  analysis  will  include  the  case  of  a  sequentially 
ordered  file  that,  when  searched  logarithmically,  makes  the  transition 
between  single-level  subjeot  headings  and  hierarchical  trees  one  of 
generalizing  a  special  case. 

For  each  type  of  file  structure  a  mathematical  expression  can 
be  derived  for  the  expected  number  of  headings  and  items  searched  and 
examined  in  order  to  locate  a  single  item  in  the  file.  Some  simplify¬ 
ing  assumptions  will  be  made  to  keep  the  mathematics  relatively  uncompli¬ 
cated.  Similar  expressions  can  be  derived,  however,  under  less  restric¬ 
tive  assumptions. 
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U.U.1.2  Single -Level  Subject  Headings  -  Suppose  there  are  s 
subject  headings.  It  is  assumed  that  the  subject  heading  under  which 
the  item  is  to  be  found  is  supplied  with  the  request.  It  is  further 
assumed  for  the  sake  of  simplicity  that  the  items  in  the  file  are  evenly 
distributed  under  the  subject  headings.  That  is,  it  is  equally  likely 
that  any  subject  heading  and  any  item  under  a  subject  heading  will  be 
requested  and  each  subject  heading  will  have  the  same  number  of  items 
filed  under  it.  The  probability  of  searching  one  subjeot  heading  is: 

pi  ■  I  <78> 

The  probability  of  searohing  two  subjeot  headings  to  find  the  requested 
one  isi 
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Similarly: 
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The  expected  number  E(i)  of  subjeot  headings  searched  is: 

F.(i)  -EiJ 
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or 


The  number  of  items  N_  under  each  subject  heading  is: 

s 


v 

s 


N 

s 


(82) 


(83) 


86 


By  an  argument  analogous  to  that  for  subject  headings,  the  expected  number 
E(i)  of  items  searched  is: 
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The  expected  number  of  items  and  subject  headings  searched  for 
in  a  linear  file  is  then: 

E  -  3-  +  1  +  b 


\  (s  +  N/s  +  2) 


(85) 


A  file  of  items  arranged  sequentially  by  some  ordering  rule— 
e.g,,  a  file  of  part  or  drawing  numbers  or  any  other  numbered  or  ordered 
items— can  be  arranged  and  searohed  by  the  method  of  subjeot  headings 
previously  described .  Another  method  of  search  is  the  following:  Go  to 
the  middle  of  the  file.  Compare  the  item  requested  with  the  item  there, 
A  deoision  oan  then  be  made  on  the  basis  of  the  ordering  of  the  items  as 
to  whether  the  item  sought  is  in  the  first  (lower)  half  of  the  file  or 
in  the  seoond  (higher)  half.  Whiohever  half  it  is  in,  go  to  the  middle 
of  that  half  and  repeat  the  procedure.  This  process  is  continued  until 
the  item  is  located.  The  process  of  going  to  the  middle  of  any  portion 
of  the  file  wj.ll  be  called  a  cut.  Since  a  single  file  item  is  examined 
for  each  cut,  the  expected  number  of  cuts  is  equal  to  the  expected  num¬ 
ber  of  file  items  which  will  be  examined.  This  method  is  called  the 
Binary  Logarithmic  search. 
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Consider  a  file  of  N  items.  By  the  search  procedure  just 


described,  the  number  of  items  that  can  possibly  be  retrieved  on  the 
first  cut  is  lj  on  the  second  cut,  2j  and,  in  general,  on  the  cuts 

Nj  =  23"1  (86) 

The  maximum  number  of  cuts  n  required  to  retrieve  any  item  -whatsoever  in 
the  file  can  be  determined  from  Equation  (86)  as  follows: 
n 

N  B  5!  N. 

3-1  J 

-  E  2**  . 

3-1 

■  2n  -  1  (87) 

Solving  for  n  gives  s 

n  -  log2(N  +  1)  (88) 

The  origin  of  the  name  logarithmic  search  is  obvious  from  Equation  (88). 


It  is  evident  from  Equation  (86)  that  the  probability  of 

retrieving  the  correct  item  in  response  to  a  given  random  request  on  the 
out  is: 


(89) 


The  expression  for  the  expected  number  of  cuts  j  (or,  equivalently,  bins 
number  of  items  examined)  is : 
n 

E  “  E  3  (90) 

3-1  N 

where  n  is  obtained  from  Equation  (88).  The  series  in  Equation  (90)  is 
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the  derivative  of  a  geometric  progression,  and  the  expression  for  its 
sum  can  be  obtained  by  differentiating  the  expression  for  the  sum  of  a 
geometric  progression  with  a  finite  number  of  terms .  This  procedure 
yields  the  following  expression  for  Es 

E  =  C^~]  log2(N  +  1)  -  1  (91) 

H.U.1.3  Hierarchical  Trees  -  Only  regular  rooted  trees  will  be 
considered  for  hierarchical  trees.  A  tree  is  rooted  if  all  its  branches 
are  connected  ultimately  to  a  single  node  (the  root),  A  tree  is  regular 
if  the  number  of  branohes  k  emanating  from  each  node  is  a  constant. 
Another  way  of  thinking  of  this  file  structure  is  that  every  heading  or 
grouping  of  the  file  organization  is  divided  into  the  same  number  of 
subheadings . 

Four  oases  of  retrieving  items  from  trees  will  be  considered. 
These  cases  are  designated  I  to  IV,  respectively. 

U.ll.1.3.1  Case  I  -  In  this  case  the  tree  is  considered  as  a 
hierarchy  composed  entirely  of  file  items,  each  of  which  is  equally 
likely  to  be  the  answer  to  a  given  random  request.  Hence,  retrieving 
a  given  node  will  be  considered  as  providing  a  single-item  response. 

The  level  of  the  node  then  represents  the  generality  of  the  response, 
which  is  presumably  related  directly  to  the  generality  of  the  request. 
The  node  provided  as  a  response  can  be  considered  as  the  name  or  term 
or  descriptor  for  all  the  nodes  at  lower  levels  of  the  tree  that  are 
connected  to  the  node  provided  as  a  response.  If  the  node  is  a  category 
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name,  all  the  connected  nodes— the  items  in  the  category- -could  be  provided 
as  part  of  the  response.  It  is  assumed  that  the  tree  is  indexed;  that 

is,  each  node  of  the  tree  contains  indexes  of  the  nodes  on  the  next  lower 
level  connected  to  it.  It  is  also  assumed  that  these  indexes  are  suffi¬ 
cient  to  ascertain  which  node  to  examine  at  the  next  level.  Thus  only 
one  node  is  examined  at  each  level  searched. 

If  each  node  of  the  tree  contains  indexes  that  are  identifiers 
of  the  nodes  at  the  next  level  at  the  end  of  the  branches  emanating  from 

it,  then  by  examining  a  given  node  a  decision  can  be  made  as  to  which 
node  to  examine  at  the  next  level.  Searching  a  tree  of  this  type  is  a 
generalization  of  the  binary  logarithmic  search.  For  example,  consider 
a  regular  binary  tree;  that  is,  k  ■  2.  Examining  the  first  node,  the 
root,  is  analogous  to  going  to  the  middle  of  the  file.  There  are  two 
nodes  at  the  next  level.  Selecting  one  is  analogous  to  going  to  the 
middle  of  the  lower  half  of  the  file;  selecting  the  other  is  equivalent 
to  going  to  the  middle  of  the  upper  half  of  the  file.  The  generalization 
of  this  process  for  larger  Integral  values  of  k  is  obvious.  The  mathe¬ 
matics  is  analogous  to  the  binary  logarithmic  search. 

The  number  of  levels  L  to  be  examined  in  order  to  guarantee  the 
retrieval  of  any  item  in  a  regular  tree  of  order  k  is: 

L  -  logk[(k  -  l)N  +  1]  (92) 

The  expected  number  of  items  examined  becomes: 

■  -  s  A  ^ 

■  -  n*  +  ii  -  e-tt  (93) 
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■where  L  is  determined  from  Equation  (92).  Thus  Equations  (88)  and  (91) 
are  merely  special  cases  of  Equations  (92)  and  (93),  respectively,  for 
regular  binary  trees. 

U.U.1,3.2  Case  n  -  In  this  case  only  the  nodes  at  the  bottom 
level  of  the  tree  represent  file  items.  It  is  assumed  that  each  such 
node  represents  a  group  of  file  items ,  Thus  a  search  consists  of  tracing 
a  path  through  the  tree  to  one  node  at  the  bottom  and  searching  the  items 
filed  under  that  node  to  provide  a  single  file  item  as  a  response.  Again, 
it  is  assumed  that  eaeh  node  is  equally  likely  to  be  the  answer.  If  this 
ease  is  restricted  to  regular  trees  with  no  method  of  indexing  or  deter¬ 
mining  whieh  oonnected  node  at  the  next  level  is  the  oorreot  one,  then 
this  case  generalizes  the  simple  subject  heading  file  to  a  multi-level 
subject  heading  or  classification  file.  Only  non-lndexed  trees  will  be 
oonsidered  in  this  ease.  A  non-indexed  tree  is  one  that  has  no  mechanism 
for  selecting  the  proper  node  at  the  next  lower  level  without  examining 
the  nodes  at  that  level  oonnected  to  the  node  at  whioh  the  searcher  is 
presently  located. 

Assume  there  are  s  nodes  or  subjeot  headings  on  a  regular  tree 
of  order  k.  Then  let  there  be  N  file  items  listed  under  the  bottom  nodes 
and  assume  that  the  file  items  are  evenly  distributed  among  these  nodes. 
Assume  also  that  there  are  L  levels  of  nodes  in  the  tree. 

Since  the  only  nodes  searched  at  each  level  are  those  connected 
to  the  node  selected  at  the  next  higher  level,  tho  probability  p^  of 
finding  the  desired  subject  heading  at  a  given  node  is: 
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(94) 


Therefore,  the  expected  number  of  nodes  examined  at  any  level  j,  except 
the  first  level  or  the  root  node  where  the  expected  number  is  1,  is: 


k  +  1 

~“S"“ 


(95) 


where  2  s  j  sL,  Hence,  the  expected  number  of  nodes  examined  for  the 
entire  tree  including  the  root  node  is: 

Ea  "  -  1)  +  1  (96) 

The  required  number  of  levels  L  in  the  tree  is  determined  by  k  and  s,  and 
is  obtained  from  Equation  (92),  which  gives: 

L  -  logk[(k  -  l)s  +  1]  (97) 

and,  by  substituting  and  simplifying* 

Es  -  iogk[(k  -  l)s  +  1]  +  i-J-JS  (98) 


At  this  stage,  no  file  items  have  been  examined.  Equation  (98) 

gives  the  expected  number  of  subjeot  headings  examined  to  find  the 

heading  at  the  lowest  level  under  which  the  file  item  sought  is  listed. 

Therefore,  the  file  items  under  that  heading  must  now  be  examined.  The 

number  of  items  N  filed  under  a  given  subject  heading  is : 
s 


N 


s 


(99) 


It  is  assumed  that  this  node  is  examined  to  identify  the  tree  and 
locate  the  nodes  at  the  second  level. 


i 

I 

I 

* 

I 


where  is  the  number  of  subject  headings,  or  nodes,  at  the  lowest  level 

of  the  tree.  This  sequence  is  a  simple  linear  file  like  the  first  one 

examined.  The  expected  number  of  file  items  searched  E  is  then: 

n 


Ns 

En(i)  "  S  f 

n  i-i  ws 


Ns  *1 


(100) 


The  number  of  nodes  a.  at  level  j  of  a  regular  tree  of  order  k  is  given 

J  t 

by: 

s ^  »  k3"1  (101) 

therefore) 


»L  ■  X1'1 

Substituting  Equation  (97)  into  Equation  (102)  yields: 
bl  ■ 


and,  from  Equations  (99)  and  (103) J 


N. 


kN 

(k  -  l)a  +  1 


Substituting  Equation  (10U)  in  Equation  (100)  gives: 


E_ 


kN  +  (k  -  l)s  +  1 

-^^TjrrTr 


(102) 


(103) 


(104) 


(105) 


The  expected  value  of  the  number  of  subject  headings  and  file 
items  examined  to  retrieve  one  file  item  in  this  type  of  file  organiza¬ 
tion  is  Equation  (98)  plus  Equation  (105): 
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(106) 


„  _  kN  +  (k  -  l)s  +  1 

E  "  '"?L'(k  T)s-TTT 

+  [^r^J  logk[(k  -  l)s  +  11  + 


It  is  now  evident  that  when  file  items  are  related  it  may  be 
possible  to  arrange  each  set  of  N  items  so  that  it  can  be  searched 

S 

logarithmically .  In  this  case  Equation  (106)  becomes: 


E 


r(k  -  1)N  + 

L  (K-ijnr 


i]  logk[(k 


+  logk[(k  -  l)s  +  1] 


(107) 


This  equation  is  obtained  from  Equations  (93),  (98),  and  (99) •  Equation 
(103)  was  used  to  obtain  the  value  of  sL* 


U.U.1.3.3  Case  III  -  This  ease  is  the  same  as  Case  I  except 
that  the  tree  is  not  indexed.  That  is,  any  node  may  be  a  satisfactory 
response  to  a  requestj  but  after  selecting  a  node  at  a  given  level,  it 
is  necessary  to  examine  the  nodes  at  the  ne.  u  lower  level  connected  to 
the  selected  node  in  order  to  ascertain  whioh  one  is  the  next  appropriate 
subheading. 


In  this  case  the  maximum  number  of  nodes  examined  at  each  level 
except  the  first  is  simply  k.  The  number  of  nodes  examined  at  the  first 
level  is  1.  Therefore,  the  maximum  number  of  nodes  examined  in  any  search 
is : 


n  -  k(L  -  1)  +  1 

hence,  from  Equations  (92)  and  (108): 


(108) 


9k 


Therefore,  the  expected  number  of  nodes  oxamined  is: 


-  |  logkC(k  -  1)N  +  1]  +  (no) 

■where  n  is  determined  from  Equation  (109). 

4.U.1.3.U  case  IV  -  This  case  considers  an  indexed  tree  of 
subject  headings  rather  than  file  items  with  the  file  items  located  under 
the  lowest  row  of  nodes  or  subject  headings.  The  equally  likely  assump¬ 
tion  is  involved,  as  usual.  Two  variations  can  be  considered.  First, 
the  file  items  are  sequential  and  searched  in  order.  Second,  the  file 
items  are  searched  logarithmically}  in  this  variation  the  items  are 
actually  filed  in  a  tree  structure . 

Since  the  subject  headings  in  this  case  are  not  responses,  the 
expected  number  of  headings  examined  is  fixed  and  equal  to  the  number  of 
levels  L  in  the  tree.  Therefore,  from  Equation  (97) * 

Es  "  ”  1)Q  +  *3  (ill) 

For  a  sequentially  searched  file,  the  expected  number  of  items  searched 
is  obtained  from  Equation  (10?) .  Therefore,  the  expected  number  of  sub¬ 
ject  headings  and  items  searched  is : 

E  ■  TTXI^i'yPv'Tr  *  l08kt(k  • 1,5  * 11  (n2) 

If  the  items  are  searched  logarithmically,  the  expected  number 

is  obtained  by  taking  N  equal,  to  N  and  than  substituting  Equation  (10U) 

s 


in  Equation  (93).  The  resulting  equation  is? 


Therefore,  the  expected  number  of  subject  headings  and  items  examined 
is  Equation  (Ill)  plus  Equation  (113) s 
E  -  logkC(k  -  l)s  +  1] 


U.lv.l.U  Analysis  and  Comparidon  of  the  Expeoted  Values  -  The 
major  purpose  of  deriving  expressions  for  the  expeoted  values  of  the 
number  of  headings  and  Items  examined  in  various  file  structures  is  that 
these  values  provide  a  convenient  (if  oversimplified)  means  of  oomparing 
the  effectiveness  of  different  file  structures.  These  file  organizations 
and  their  corresponding  average  values  are  summarized  in  Table  1. 


For  general  purposes  of  comparison  the  equations  identified  in 
Table  1  can  be  rewritten  in  simpler  form.  The  simplified  versions  are 
given  below  with  their  original  numbers  followed  by  "A" .  The  subscript 
s  stands  for  subject  headings}  N  for  file  items.  For  a  file  with  single¬ 
level  subject  headings,  and  no  other  structure. 


■  ^  £s  +  n/s  +  2"]  ■ 


s  +  1 


(83a) 


where  N  is  obtained  from  Equation  (83). 
s 
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For  an  indexed  tree  of  items  (Case  I), 

B  <wa> 

where  =  n  is  obtained  from  Equation  (92). 

For  a  non-indexed  tree  of  subject  headings  with  items  stored 
sequentially  (Case  II-A), 

E  -  pH  ^-l  (Ls  -  1)  +  1  +  (106A) 

where  L  is  obtained  from  Equation  (97) ,  and  N  ,  from  Equation  (10U) . 

B  8 

For  a  non-indexed  tree  of  subjeot  headings  with  items  stored 
in  an  Indexed  tree  (Caae  II-B), 

1 ;  Irf1]  \  ■ «  * 1  ♦  \  - ftt  (**irrr)  w 

where  L0  and  are  obtained  from  Equation  (97),  and  No,  from  Equation 
(104). 

For  a  non-indexed  tree  of  items  (Case  III), 

E  -  |  (L,j  -  1)  +  1  (1I0A) 

where  -  n  is  obtained  from  Equation  (92). 

For  an  indexed  tree  of  subject  headings  with  items  stored 
sequentially  (Case  IV -A), 

N  +  1 

E  ■  Lg  +  -~-g  (112A) 

where  L  is  obtained  from  Equation  (97),  and  N  ,  from  Equation  (104). 
s  s 


97 


98 


99 


For  an  indexed  tree  of  subject  headings  with  items  stored  in  an 
indexed  tree  (Case  IV -B), 

E  "  Ls  \  " rrx  ('“ra)  (iiua) 

where  Lg  and  are  obtained  from  Equation  (91),  and  Ns,  from  Equation 

(ioU). 

These  equations  can  be  analyzed  in  two  major  ways  with  respect 
to  E.  The  first  is  to  ascertain  within  a  given  equation  whether  there 
is  a  relationship  between  s  and  N  that  will  minimize  E  for  that  type  of 
file  organization.  The  second  is  to  oompare  the  equations  with  each 
other  to  determine  whether  some  file  structures  are  always  superior  to 
others . 

To  carry  out  the  first  analysis  it  is  sufficient  to  assume  that 
s  can  take  any  positive  real  value  and  to  differentiate  each  of  the  equa¬ 
tions  with  respect  to  s,  considering  N  as  a  constant,  and  checking  to 
see  if  the  resulting  extremum  is  indeed  a  minimum.  If  there  is  such  a 
relationship  between  s  and  N,  provides  the  proper  number  of  subject 
headings  a  to  minimize  E  for  a  file  of  N  items  with  that  type  of 
organization,* 


In  the  following  discussion  the  values  of  s,  which  optimize  the  expected 
number  of  headings  and  items  examined,  are  obtained  for  several  of  the 
file  organizations.  This  derivation  is  accomplished  by  differentiating 
the  expression  for  E  with  respect  to  s  to  obtain  the  appropriate  s  as  a 
function  of  N  that  minimizes  E,  Strictly  speaking,  such  a  procedure  is 
not  permissible  because  all  the  distributions  considered  are  discrete. 

E  is  defined  only  for  positive  integral  values  of  s  and  N.  Nevertheless, 
the  equations  for  E  in  all  cases  are  continuous  functions  for  the  domains 
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For  example,  taking  the  partial  derivative  of  E  with  respect 
to  s  in  Equation  (85A)  and  setting  the  result  equal  tc  zero  yields? 

s  =  vHT  ,  (11^) 

\ 

for  a  file  with  single -Id vel  subject  headings  only,  A  check  reveals 
that  the  appropriate  conditions  for  a  minimum  are  satisfied.  That  is, 
the  value  of  s  given  in  Equation  (115)  will  always  result  in  a  minimum 
E  for  that  N.  Substituting  Equation  (115)  in  Equation  (85A)  gives: 

‘W.o./r  (u£) 

From  Equations  (83)  and  (315),  the  optimum  number  of  items  under  the 
subject  headings  1st 

Ns  -  vHT  (317) 

Equation  (93A)  for  Case  I  oannot  be  treated  in  this  manner 
because  it  is  a  function  of  N  only  (and  k).  However,  as  k  increases, 

E  decreases  for  a  constant  N.  This  fact  must  be  interpreted  carefully 
because  no  two  aribtrarily  selected  values  of  k  will  necessarily  yield 
an  integral  value  of  L  for  a  fixed  N. 


of  k,  s,  and  N  that  are  of  interest.  Consequently,  these  differentia¬ 
tions  can  be  carried  out  formally  and  the  relative  minima  obtained.  To 
obtain  the  integral  values  of  s  that  minimize  E,  it  is  then  necessary  to 
substitute  the  two  integers  closest  to  the  minimum  s  into  the  equation 
for  E  to  ascertain  which  gives  the  smaller  E.  This  integer  is  then  vised 
as  the  minimum,  provided  it  is  positive.  Even  this  procedure  would  not 
be  sufficient  were  it  not  for  the  fact  that  these  functions,  in  the  cases 
considered,  have  only  one  relative  minimum,  and,  therefore,  this  relative 
minimum  is  also  an  absolute  minimum.  The  ultimate  justification  for  these 
unrigorous  techniques  is  that  they  do  provide  the  real  minima  and,  there¬ 
fore,  have  considerable  utility. 
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Application  of  the  same  method  to  Equation  (106A)  for  Case  Ix-A 


yields : 


018) 


This  value  of  s  for  any  N  will  yield  the  minimum  E,  and  the  value  of  E 
is  j 


Via 


(1W) 


Equation  (107A)  for  Case  XI-B  has  no  relative  minimum.  However, 
the  optimum  value  for  s  can  be  obtained  by  observation.  By  substituting 
Equations  (97)  and  (104)  in  (107A)  and  simplifying,  the  result  is : 

E  -  [logk[(k  -  l)s  ♦  1]  -  I): 

+  logjc[k(k  -  1)N  +  (k  »  l)s  +  1)  «  k  i j  ^  (107B) 

This  equation  is  defined  for  s  4  1.  For  chis  range  'of  s,  Equation  (107B) 
has  a  minimum  at  s  “1.  This  minimum  gives  for  Ei 

E  "  1  +  h  ~ 

The  single  subject  heading  is  superfluous  and  can  be  eliminated.  The 
minimum  E  thus  becomes: 

Via  1  %  -  FTT 

Therefore,  the  optimum  s  for  Equation  (107A)  is  zero,  and  the  equation 
has  been  reduced  to  Equation  (93A).  Consequently,  it  is  disadvantageous 

to  superimpose  a  non-indexed  tree  of  subject  headings  on  an  indexed  tree 

of  file  items. 
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Equation  (11QA.)  for  Gass  III  is  also  a  function  of  N  and  k  only. 
In  this  case  a  minimum  E  cannot  be  easily  derived  analytically.  But  solv¬ 
ing  Equation  (110)  numerically  indicates  that  E  is  a  minimum  when  k  =  2 
for  N  <  100  and  when  k  13  2  for  N  i  £>00. 

For  Equation  (112A)  (Case  I7-A)  the  s  that  gives  minimum  E  is: 

.  _  i  r  mi  ,1 

3  F^nr  Lnog^e  -  XJ  (121) 


The  minimum  E  Wmmai 


(122) 


Equation  (lli|A)  for  Case  IV-B  has  no  relative  minimum.  However, 
the  optimum  value  for  s  can  he  obtained  as  follows.  By  substituting 
Equations  (9U)  and  (97)  in  Equation  (llliA)  and  simplifying,  it  beoomea : 

E  -  logk[k(k  -  1)N  +  (k  -  l)s  +  1]  -  j-i-j  '(Ul»B) 

This  equation  is  defined  for  s  *  1.  Obviously,  it  has  an  absolute  min¬ 
imum  at  s  *  1,  whioh  gives: 

s  :  1  *  %  -  FVI 

The  single  subject  heading  again  is  superfluous,  and  E  becomes: 

3\idn  "  Ln  “  TTT-X  (123) 

Thus  the  optimum  s  for  Equation  (111) A.)  is  zero,  and  this  equation  is  also 
reduced  to  Equation  (93A).  In  other  words,  wherever  it  is  possible  to 
construct  an  indexed  tree  of  items,  it  is  pointless  to  superimpose  an 
indexed  tree  of  subject  headings  upon  it.  It  is  also  pointless  to 
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establish  any  other  system  of  subject  headings.  One  example,  namely 
Equation  (107A),  has  already  been  considered. 

The  second  type  of  analysis  comparer’  one  equation  with  another 
for  an  arbitrary  but  specified  file  size  N  and  for  a  number  of  headings 
s;  the  objective  is  to  determine  whether  E  is  always  less  in  one  type 
of  file  organization  than  in  another.  Equations  (107A)  and  (11)|A)  have 
been  shown  to  be  superfluous  and  will  not  be  considered. 

The  files  with  no  subjeot  headings  will  be  considered  first. 

For  a  given  N,  an  indexed  tree  of  items,  Equation  (93A),  will  yield  a 
lower  average  number  of  items  searched  than  a  non-indexed  tree  of  items, 
Equation  (UOA),  if: 

h  “  FTT  <  |(i*n  -  i)  + 1 

This  inequality  can  be  written: 

-  1)  “  -  1)  (12U) 

The  inequality  is  clearly  valid  for  k  5s  2 .  Consequently,  the  average 
number  of  items  examined  in  searching  an  indexed  tree  of  N  items  is 
always  less  than  the  average  number  examined  in  a  non-indexed  tree. 

Indexed  and  non-indexed  trees  with  sequentially  stored  items 
can  be  compared  in  the  case  wher  the  number  of  headings  in  both  trees 
is  the  same.  Equation  (106A)  for  non-indexed  trees  and  Equation  (112A) 
for  indexed  trees  can  be  compared  in  terms  of: 

(la  -  1)  *  1  >  1, 


ioU 


or 


[^i]  (L. 


1)  >  L  -  1 


m) 


This  inequality  is  clearly  valid  for  k  Jt  2  and  L  i  1.  Therefore,  Equa- 

s 

tion  (112A)  gives  a  smaller  E  than  Equation  (106a).  It  is  clear,  however, 
from.  Equations  (118)  and  (121)  that  the  optimum  s's  for  the  two  trees  of 
Equations  (U2A)  and  (106A)  are  not  identical.  Nevertheless,  it  oan  be 
shown  directly  from  Equations  (119)  and  (122)  that  Equation  (112A)  also 
yields  a  smaller  E  than  Equation  (106A)  when  s  is  optimized  in  eaoh  oase. 
Tiiia  optinazatlon  mold  require* 


logk  [rfc]  <  logk  [lk"TI)!Log.e] 


(k+l)/2 


or 


(126) 


eN  .  r  eN 
2  log.  e  <  L(k  +  l)log.  e. 


(k+l)/2 


This  inequality  is  valid  for! 

1  (k  +  1) 

H  >ts7te=tt  — 


(k+l)/(k-l) 


(127) 


gt./  v1'-*/  e  log#k 

This  condition  presents  no  restriction  for  a  praotical  case.  For  example. 
Equation  (127)  requires  N  a  1*  if  k  “  2;  N  a  3,  if  k  ■  lOj  N  a  6,  if 
k  -  100. 


For  a  given  N  and  a  fixed  s  >  1,  an  indexed  tree  of  subject 
headings,  Equation  (112A),  always  gives  a  lower  value  of  E  than  a  single 
level  of  subject  headings.  Equation  (8f>A).  The  conditions  would  require: 
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Ttri  s  inequality  can  be  transformed  by  algebra  to : 


k"(S+”)/2[(k  -  l)s  +  IT  <1  (128) 

By  differentiating  the  left  member  of  Equation  (128)  with  respect  to  k 
and  setting  it  equal  to  zero,  a  value  for  k  can  be  obtained  to  make  it 
an  extremum.  This  value  iss 

k  =*  (12?) 

By  examining  the  second  derivative  at  this  point,  it  is  observed  that 
Equation  (129)  maximizes  the  left  member  of  Equation  (128)  when  s  >  1. 
This  maximum  value . 1st . . . 


F-VI 

LS+lJ 


(e+l)/2 


(130) 


For  s  >  1,  the  Value  (130)  is  always  less  than  1,  Since  the  maximum 
value  satisfies  Equation  (128),  any  other  value,  in  particular  any 
k  a  2,  will  also  satisfy  it. 


When  s  is  optimized  in  each  case,  these  two  file  structures  oan 
be  compared  by  Equations  (116)  and  (122).  Equation  (112A)  will  give  a 
lowor  E  than  Equation  (8£a)  in  the  optimum  case  when* 

i  * [rife?]  <  mr  + 1 


By  algebraic  transformations,  this  inequality  can  be  written*. 


N  In  k  2 
4  e 

When  k  “  2,  this  inequality  is  valid  for  N  a  27;  when  k 
for  Nil;}  when  k  i  6,  it  holds  for  Nil. 


(131) 

h,  it  is  valid 
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The  optimum  cases  of  Equations  (106A)  and.  (St A)  can  be  compared 
by  using  Equations  (116)  and  (119).  Equation  (106A)  will  yield  a  smaller 
E  when: 


|  +  logk  [TE'TlTTo^a]  <  +1 

that  is,  when 


N  In  k 


<  - 


Equation  (132)  is  generally  valid  for  larger  files.  For  example,  a 


(132) 


..simple  ;.caloxa^tion..vMth::k;.  *1.0 ia  .ve0d4-tcee:  V,. 

roughly  greater  than  ll£  and  invalid  for  smaller  N.  Hence,  the  single 
level  subject  heading  file  results  in  a  smaller  average  number  of  Items 
searched  in  files  with  less  than  ll£  items.  This  conclusion  is  shown 
dearly  in  Figure  6. 


Figure  6  depicts  the  average  number  of  headings  and  items 
examined  for  a  wide  range  of  file  sizes .  Only  optimum  values  for  s 
are  shown.  The  figure  indicates  the  superiority  of  indexed  trees  over 
non-indexed  trees  and  of  non-indexed  trees  over  single -level  subject  head¬ 
ings,  except  for  small  files  as  indicated  by  Equation  (1.^2),  However, 
the  degree  of  superiority  of  the  indexed  trees  is  somewhat  misleading. 
Although  it  is  true  that  the  average  number  of  headings  and  items  examined 
or  searched  for  such  trees  is  much  smaller  than  for  the  other  file  struc¬ 
tures,  this  fact  does  not  imply  much  faster  response  times.  By  omitting 
consideration  of  the  indexing  function  itself,  the  burden  of  search  has 
in  a  sense  merely  been  shifted  elsewhere.  Unless  the  indexing  function 
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FIGURE  6.  Average  Number  of  Headings  and  Items  Examined  in  a  Search 
of  Differently  Organized  Files 
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is  powerful,  the  search  procedure  in  an  indexed  tree,  particularly  where 
k  is  large,  may  spend  almost  as  much  time  examining  indexes  to  determine 
the  appropriate  paths  as  would  he  involved  in  examining  the  headings 
themselves . 

A  singular  feature  of  Figure  6  is  that  the  indexed  tree  of 
items,  Equation  (93A),  and  the  indexed  tree  of  headings,  Equation  (112A), 
give  similar  values  of  E.  The  same  is  true  for  the  non-indexed  trees 
represented  by  Equations  (11QA)  and  (106A).  The  explanation,  however, 
is  simple .  (113)  mid  -(i£l that.-  ..4h©  number  cf  sublet 

headings  should  be  so  large  that  essentially  only  a  few  items  oh  even  a 
single  item  are  filed  sequentially  under  e&eh  node  of  the  last  row.  In 
other  words,  N  is  small.  This  fact  can  be  seen  from  the  values  of  N 
derived  by  Inserting  Equations  (118)  and  (121),  respectively,  into  Equa¬ 
tion  (ICh).  These  values  are: 

Na  -  (lc  +  l)logke  (133A) 


Ns  “  2  logke 


(k  *  7) 


(133B) 


Ns  -  1  (k  >  7)  j 

Consequently,  almost  all  the  searching  is  performed  In  the  tree  of  head¬ 
ings  where  it  is  most  economical.  Hence,  the  close  correspondence  arises 
between  trees  of  headings  and  between  trees  of  items.  Of  course,  in 
practice,  it  may  frequently  be  impossible  to  achieve  a  meaningful  break¬ 
down  of  related  headings  to  such  a  detailed  level.  Therefore,  the 

optimum  values  of  s,  N  ,  and  E  should  be  regarded  as  interesting  ideal- 

s 

izations.  In  practice,  only  integral  values  of  s  and  N  can  be  usedo 

s 
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In  cases  where  the  optimum  curves  plotted  in  Figure  6  are 
unrealistic  because  they  restrict  s  too  much,  the  equations  developed 
in  this  and  the  previous  section  can  be  used  to  generate  complete  sets 
of  design  charts.  From  these  charts  the  best  file  organization  can  be 
read  in  terms  of  whatever  value  s  must  have  to  reflect  the  logical  rela¬ 
tionships  and  the  nature  of  the  subject  matter  to  be  classified. 

In  tha  interest  of  completeness,  Figure  7  is  included  for  ref¬ 
erence.  It  relates  the  number  of  levels  of  nodes  in  a  regular  tree  of 
order  k  to  accommodate  N  items,  one  item  per  node.,  figure  7  is  obtained 
from  Equations  (92)  or  (97). 

U.U.1.5  Variance  From  the  Expected  Values  -  The  utility  of  the 
average  or  expeoted  number  of  items  and  headings  examined  in  different 
file  structures  depends  upon  the  likelihood  that  the  number  of  items  and 
headings  searobed  will  generally  be  near  the  average  value.  An  estimate 
of  this  likelihood  is  provided  by  the  statistical  variance  of  the  number 
of  items  and  headings  searched  from  the  average  number.  Expressions  for 
the  variance  relative  to  Equations  (8J>A),  (93)*,  (106a),  (11QA),  and 
(112A)  will  be  developed  and  analyzed, 

p 

Directly  from  the  definition,  the  variance  a  of  the  single - 
level  subject  heading  file  can  be  written: 


In  this  case  Equation  (93)  will  be  used  instead  of  Equation  (93A).  Equa¬ 
tion  (93A)  is  not  sufficiently  accurate  to  be  used  in  computing  the  var¬ 
iances,  because  t'.ie  variances  are  small.  The  computation  is  based  upon 
differences  between  numbers  that  are  approximately  equal. 
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N 

2  ®  1  r,  s  +  1,2  .  rs  1  r .  N  +  s,2 

*  ?  V-  ~  “5--1  +  TT  ~  ~2iH 


(13)0 


Carrying  out  the  summations  yields! 


a2  .  + 


(13?) 


.  "(8?A)  “  ^  Cs  +  (N/s>  -  23  (136) 

[Note:  the  subscript  such  as  (8£A)  references  the  equation  related  to 

a.  ffivgjl  TSTlfiPCv  1  1 

'  \  '  V.  > 

By  differentiating  Equation  (136)  with  respect  to  s,  setting 

■' ..  ;i 

the  result  equal  to  aero,  and  checking  the  appropriate  requirements,  it 
can  be  shown  tnatt 

s  -  //¥“  (137) 

gives  the  minimum  variance.  Thus  the  s  that  gives  minimum  E,  Equations 
(11?)  and  (116),  also  gives  the  minimum  varianoe.  This  value  iss 

°min  -  (138) 


For  the  indexed  tree  of  items,  the  variance  is: 


_2  _  1 


* 


*  £  ^  b  “  E  ( 93 )  J 


(139) 


where  n  is  given  by  Equation  (92).  An  elementary  theorem  of  mathematical 
statistics  states  that  Equation  (139)  is  equal  to: 
n 


a2  -  |  i  W-1  -  E2 


<u*0) 


3-1 


where  E  is  the  expected  value  obtained  from  Equation  (93).  The  sum  in 
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Equation  (IJ4O)  can  be  evaluated  by  using  some  relationships  among  the 
derivatives  of  arithmetic  and  geometric  series.  Generating  functions 
can  also  be  employed  directly  and  effectively,  in  this  case,  to  obtain 
the  variance.  Using  either  of  these  methods,  the  following  expression 
for  the  variance  can  be  derived? 

2 

_2  _  1  '  2Ln  rtT  .  k  +  n 

CT(93)  iTTT  lir  -  C'k  -  1)11  "  2LN  +  FTTJ 

+  ln2  -  e2  (mi) 

where  n  *  is  obtained  from  Equation  (92)  and  E  from  Equation  (93). 
Equation  (111)  can  be  used  to  compute  the  variance  for  relatively  small 
size  files  (moderately  large  n) . 


As  N  becomes  arbitrarily  large,  however,  Equation  (Utl)  approaches 
the  following  limiting  value? 

’<»> ;  irV  t3h2) 

Equation  (lH)  oonverges  relatively  rapidly  to  Equation  (ll*2).  For 
example,  when  k  ■  10,  the  following  errors  in  the  variance  are  intro  - 


duoeds 

JL 

Error  in 
Equation  (ll*2) 

103 

1.11# 

wi* 

.70$ 

105 

• 

O 

\n 

Tills  point  is  primarily  of  academic  interest,  since  the  variances  given 
by  Equations  (U4I)  and  (11*2)  are  insignificant.  For  k  at  3,  the  variance 
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given  by  Equation  (llil)  is  lees  than  1.  It  can  be  shown  that  the  variance 
is  a  monotonically  increasing  function  of  N,  and  that  Equation  (1)4?.)  is 
an  upper  limit  for  the  variance. 


Applying  similar  methods,  the  variances  for  the  other  file 
structures  were  derived.  They  are: 


2  (k  +  l)(k  -  1) 

°(106A)  "  - L  '-s 


w,  N  2  -  1 

(l.  - 1)  +  0 


“IF 


where  L  is  obtained  from  Equation  (97) }  N_,  from  Equation  (10U) . 

S  0 

_2  _  n2  -  1 

CT(11CA)  ”TT“ 

V 

where  n  is  obtained  from  Equation  (10?) , 


(U*3) 


„2 

°(112A) 


2 

N  -  1 

8 

— I T~ 


i  (H*5) 


where  N  is  obtained  from  Equation  (l&i). 


The  variances  of  Equations  (106A)  and  (112A)  oan  now  be  derived 
for  optimum  s,  From  Equations  (97)  and  (118): 

V  ■  l08k 

■  1  *  l08k  [(k  y  i)iogko] 

Substituting  Equations  (133A)  and  (II4.6 )  into  Equation  (XU3)  yields: 

a(106A)opt  =  T?  {(k?  -  1)logk  [(k  T TJSSgji] 


(1 1*6) 


(k  +  l)2(logke)2  -  l} 


(11*7) 


nil 


in  the  case  of  equation  (112A),  substituting  Equation  (133B)  into  Equation 
(lii5)  gives: 


2 

a(112A) 


opt 


U(1op  e)  -  1 

- - 


(1U8) 


Whenever  the  optimum  N  given  by  Equation  (133B)  is  less  than  1.  N  is 

s  s 

taken  as  1  and  the  variance  given  by  Equation  (lU8)  is  zero.  The  reason 
is,  of  oourse,  that  in  this  case  there  is  a  unique  indexed  procedure  to 
locate  any  item  in  a  fixed  number  of  steps. 


The  standard  deviations  from  the  expeoted  values  are  shown  in 

\  i 

Figure  8.  In  other  words,  Figure  8  is  a  graph  of  ,  cf(93)> 

cT(Hqa),  and  obtained  by  taking  the  positive  square  root  of 

Equations  (138),  (li*l),  (U40#  and  (11+7),  respeotively.  The  graph  was 
plotted  for  k  *  10.  For  this  value  of  k,  the  standard  deviation  of  the 
indexed  tree  of  headings  with  sequential  items  is  zero  for  the  reason 
given  after  Equation  (11*8).  Consequently,  this  standard  deviation  has 
not  been  inoluded  in  the  graph.  As  Figure  8  indicates,  the  standard 
deviation  of  the  indexed  tree  of  items,  Equation  (11*1),  is  also  negligible. 
Hence,  the  expeoted  value  is  a  good  indicator  of  the  actual  number  of  head¬ 
ings  and  items  examined  ;Ln  a  single  search  of  an  indexed  tree.  The  stand¬ 
ard  deviation  for  the  non-indexed  tree  of  headings,  Equation  (ll*7),  is 
somewhat  larger j  for  the  non-indexed  tree  of  items,  Equation  (1)|1|),  it 
is  still  larger.  For  reasonably  large  files,  the  largest  deviation  is 
the  single  level  subject  heading  file,  Equation  (138).  Consequently, 
the  expected  number  of  headings  and  items  examined  is  not  a  good  indicator 
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STANDARD  DEVIATION  -  c 

FIGURE  8,  Standard  Deviation  From  Average  Number  of  Headings  and  Items 
Examined  in  a  Search 


of  what  wi.ll  occur  in  any  given  search  of  a  single-level  file.  This  point 
is  verified  by  anyone’s  experience  with  this  kind  of  file. 

Figure  9  compares  the  cumulative  probability  distributions  for 
three  types  of  files.  It  indicates  rather  clearly  the  wide  variation  in 
n  among  the  file  types  (with  a  fixed  file  size)  for  ary  given  probability 
that  the  number  of  headings  and  items  searched  will  be  not  greater  than  n 
in  any  single  search.  For  example,  in  a  file  of  111,111  items  the  proba¬ 
bility  is  .5  that  fewer  than  7  items  will  be  examined  in  an  inde^uod  tresj 
fewer  than  25  in  a  non-in dexed  treej  but  fewer  than  335>  in  a  sequential, 
single -level  heading  file. 

U.H.1.6  Generalized  Expressions  for  Expected  Valuea  -  The  purpose 
of  this  section  is  to  present  generalized  expressions  for  the  expnoted  num¬ 
ber  of  headings  and  items  searched,  when  two  previous  assumptions  are 
removed.  These  assumptions  arej 

(a)  Each  subject  heading  or  item  is  equally  likely  to  be  the  one 
sought. 

(b)  The  same  number  of  items  is  filed  under  each  heading. 

For  example,  if  information  Is  available  on  anticipated  or  past  activity 
of  the  file  items— and  if  this  information  indicate!  the  likelihood  of  a 
given  heading  or  item  being  requested— then  the  expected  number  of  headings 
and  items  searched  can  be  obtained  in  terms  of  the  available  data  that 
approximate  the  probability  distribution  of  file  activity.  Generally,  the 
more  specialized  the  contents  of  a  file,  the  better  known  and  more  stable 
will  be  its  activity.  When  the  activity  of  the  file  is  known  and  it  is 
relatively  stable,  it  is  clearly  advantageous  to  organize  thB  file  so  that 
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FIGURE  9.  Cumulative  Probability  Distributions  for  a  Search  of 
Differently  Organized  Files 
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the  items  that  have  the  greatest  likelihood,  of  being  requested  are  the 
most  accessible.  For  obvious  reasons  such  a  file  is  called  activity 
organized.  It  is  the  intent  of  this  section  to  provide  a  general  back¬ 
ground  for  the  investigation  of  activity  organized  files  in  terms  similar 
to  those  appearing  in  previous  sections.  For  the  sake  of  simplicity, 
expressions  for  expected  values  will  be  presented  for  only  two  of  the 
filD  organizations.  These  expressions  will  provide  a  starting  point  for 
the  analysis  of  activity  organized  files.  In  each  oa.'ie,  p(i)  indicates 
the  probability  that  the  1—  item  or  heading  is  the  answer  to  a  request. 

The  single-level  subjeot  headings  with  sequential  items,  Equation 
( 85 ) ,  general! zas  to  j 

s  s  r  h*  T 

E-  E  ip.(i)  +  E  [E  JP±(3)  J  P-(i)  Qk9) 

i-1  8  i-1  Lj-1  1  J  8 

where  s  -  tfye  number  of  subjeot  headings  in  the  file, 

■  the  number  of  items  under  heading  1. 

p  (1)  -  the  probability  that  the  answer  to  a  request  is  under 
8  heading  i. 

p(j)  ■  the  probability  that  item  j  is  the  answer  to  a  request. 

p.  (j)  “  the  probability  that  item  j  will  he  requested,  given 
1  that  it  is  filed  under  heading  i. 

This  last  probability  is  obtained  from: 

P(j)  ■  Ps(i)  •  P1(3)  (^O) 

The  expected  value  for  the  indexed  tree  of  items,  Equation  (93), 
generalizes  to: 
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b  =  i  jp(j)  (iSL) 

3=1 

11. 

•where  p(j)  is  the  probahility  of  finding  the  answer  on  the  j— -  cutj  it 
is  given  by: 

k^1 

p(J)  -  S  P,(1)  (152) 

i-1  J 

where  Pj(i)  is  the  probability  that  the  node  on  level  j  is  the 
requested  item.  Values  for  n  are  obtained  from  Equation  (92). 

li.lt. 1.7  Summary  -  Conclusions  have  been  developed  and  presented 
throughout  this  seotion  and  will  be  summarised  only  briefly.  These  con¬ 
clusions  are  valid  only  for  files  where  every  heading  and  item  is  equally 
likely  to  be  required  for  a  response. 

(a)  In  terms  of  expected  values,  indexed  trees  give  a  lower  average 

! 

number  of  headings  and  items  examined  than  non-indexed  trees . 
Non-indexed  trees  give  lower  values  than  single -level  subject 
headings,  except  for  small  files.  The  break-even  points  can  bo 
determined  precisely  from  the  equations  in  Seotion  it.U.l.U. 

(b)  Whenever  a  file  of  items  oan  be  Indexed  or  ordered  irbo  a  tree 
structure,  it  is  disadvantageous,  in  terms  of  expected  values, 
to  superimpose  any  heading  structure  on  the  items. 

(c)  For  trees  and  single-level  subject  heading  files,  relationships 
between  the  number  of  headings  s  and  the  number  of  items  N  in 
the  file  minimise  the  expected  number  of  headings  and  items 
that  will  be  examined  in  a  file  search. 


120 


(d)  The  standard  deviation  from  the  average  number  of  headings  and 
items  examined  for  indexed  trees  is  small.  Consequently,  these 
average  numbers  are  excellent  indicators  of  the  number  of  head¬ 
ings  and  items  likely  to  be  examined  in  a  single  search.  The 

I 

deviations  for  non-indexed  trees  are  somewhat  larger,  so  expected 
values  have  less  utility.  Finally,  the  deviation  from  the 
expected  values  of  the  file  with  single-level  headings  and 
sequential  items  is  so  large  that  the  average  values  are  poor 
indicators  of  the  number  of  headings  and  items  examined  in  any 
single  search,. 

U.I4.2  The  Multi -Liat  System 

U.l+,2,1  General  -  This  section  surveys  and  summarizes  some  basic 
oonoepts  of  information  storage  and  retrieval  and  their  related  mathematical 
models ,  It  is  intended  primarily  to  provide  a  comprehensive  review  and 
evaluation  of  the  Frywes  and  Gray  Multi-List  system,  but  within  the  con¬ 
straints  of  their  report  [62], 

The  need  for  a  new  approach  to  the  solution  of  information 
retrieval  problems  had  led  some  investigators  to  abandon  the  addressable 
memory  in  favor  of  an  associative  type  of  memory,  in  whioh  information  can 
be  retrieved  on  the  basis  of  content  rather  than  physical  location  or 
address.  However,  it  is  possible  to  vise  an  addressable  memory  in  such  a 
way  that  information  can  be  retrieved  on  the  basis  of  its  description  ty 
simulating  an  associative  memory.  For  instance,  Newell,  Shaw,  and 
Simon  [5>2]  simulated  by  programming  a  type  of  associative  memory  in  which 
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lists  of  arbitrary  length  and  organization  could  be  generated  by  annexing 
registers  from  a  common  store. 

One  major  advantage  of  an  associative  store  is  that  the  allocation 
of  storage  space  fur  data  is  coordinated  with  the  actual  generation  of  the 
data,  thus  achieving  a  sort  of  local  optimization,  since  each  basic  item 
of  data  occupies  a  minimal  amount  of  space.  A  seoond  advantage  is  that 
data  having  multiple  occurrences  usually  need  not  be  stored  in  more  than 
one  place,  since  there  is  an  overlapping  or  intersection  of  lists.  The 
Multi -List  system  extends  the  associative  memory-list  storage  concept  $ 
eaoh  item  of  data  appears  only  once  in  an  addressable  memory,  and  descrip¬ 
tors  and  oontrol  information  place  the  data  item  on  a  number  of  separate 
lists .  Although  this  technique  requires  a  large  amount  of  storage,  it  has 
fast  acoess  and  retrieval .  The  advantage  of  using  an  addressable  memory 
to  simulate  an  associative  memory  is  that  this  method  permits  a  versatility 
of  requests  and  responses  that  are  not  attainable  by  sliding  the  asso¬ 
ciative  memory  features  into  the  hardware. 

Much  of  the  literature  on  file  organizations  and  storage  alloca¬ 
tion  techniques  indicates  that  ohain  allocations  and  tree  structures  are 
among  the  best  techniques  available  for  efficient  storage  and  retrieval 
systems.  The  chained  allocation  is  simply  a  typo  of  list  processing 
technique  in  which  each  item  is  associated  with  the  addresses  of  other 
related  items  of  the  file.  The  tree  structures  often  encompass  several 
types  of  allocation  techniques j  for  example,  combining  random  and  ordered 
allocation.  A  system  that  provides  an  efficient  combination  of  the  tree 
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structure  and  list  storage  techniques  would  appear  to  be  a  promising 
solution  to  the  information  storage  and  retrieval  problem j  hence,  the 
investigation  of  the  Prywes  and  Gray  Multi -List  system,  which  combines 
these  two  techniques . 

U.U.2.2  Description  of  the  Multi -List  System  -  The,  Multi -List 
system  described  in  the  Prywes  and  Gray  report  has  the  following  system 
requirements; 

(a)  The  use  of  an  associative  memory — for  storing,  deleting,  and 

reading  of  information  without  requiring  addressing  •  ,, 

(b)  A  hierarchy  of  memories  varying  in  speed  and  storage  oapaoity. 

(o)  Processor  organization  and  timing  that  are  intended  to  minimize 
the  time  for  instruction  retrieval  and  housekeeping  routine. 

(d)  Processor  instructions  that  can  process  items  of  data  of  vary¬ 
ing  length. 

(e)  Built-in  automatic  retrieval  of  programs  by  name  to  allow  for 
much  greater  vocabulary  and  ease  of  communication  with  the 
oomputer. 

U.U.2,2,1  The  Doaorlptive  Structure  -  Information  is  Btored 
in  the  Multi -List  system  in  the  form  of  a  set  of  items,  each  with  an 
assooiated  set  of  descriptors.  Each  descriptor  speoifies  a  single 
property  of  the  iter. .  A  descriptor  consists  of  an  attribute  and  a  value; 
the  attribute  specifies  a  olass  of  descriptors  (e.g.,  color,  account 
number),  and  the  value  specifies  the  actual  element  of  the  class  (e.g., 
chartreuse,  20178) ,  Two  descriptors  are  mutually  exclusive  if  no  single 
item  can  be  described  by  both  of  them;  attributes  are  defined  so  as  to 
ensure  that  any  two  descriptors  with  the  same  associated  attribute  (e.g., 
color-chartreuse  and  color-green)  are  mutually  exclusive.  For  the  sake 
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of  efficiency,  attributes  are  organized,  into  groups  called  superf ields , 


and  the  values  associated  with  tVie  attributes  in  the  superfield  are 
combined  to  form  a  numerical  ley ,  Thus*  keys  bear  the  same  relation  to 
superf ields  that  values  bear  to  attributes . 

k.k.2,2«2  The  Memory  Structure  -  An  addressable  memory  is  used 
to  simulate  the  associative  memory.  This  memory  is  divided  into  two 
parts ; 

(a)  The  tree  structure  -  The  tree  structure  is  used  in  order  to 
provide  aocess  to  all  items  having  a  given  set  of  descriptors. 

In  describing  the  tree,  the  terms  branoh  and  node  are  used  in 
the  usual  sense.  Eaoh  branoh  emanating  from  the  top  node 
represents  a  superfield.  The  lowest-level  nodes  under  a  super¬ 
field  give  the  individual  keys  associated  with  that  superfield, 
and  eaoh  intermediate  node  represents  the  set  of  nodes  below  it. 
Thus,  as  one  traverses  the  tree  from  bottom  to  top,  one  starts 
with  an  individual  key  and  encounters  successively  larger  sets 
of  keys,  eaoh  of  which  contains  the  preceding  set.  Sinoe  all 
keys  are  numerical,  an  appropriate  arrangement  makes  it  possible 
to  label  eaoh  nods  with  an  indication  of  the  set  of  keys  it 
represents.  Consequently,  it  is  easy  to  trace  down  the  tree 
from  top  to  bottom  and  locate  the  node  at  the  bottom  level  cor¬ 
responding  to  a  particular  given  key. 

(b)  The  multi -association  area  -  In  this  area  the  file  is  contained 
in  the  form  of  lists .  A  list  consists  of  a  sequence  of  items . 
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Each  item  contains  the  machine  address  of  the  next  item  on  the 


list.  A  list  emanates  from  each  bottom  node  of  the  tree;  the 
list  contains  precisely  those  items  that  are  associated,  with 
the  key  corresponding  to  the  node.  An  item  can  be  contained  in 
as  many  lists  as  there  are  superfields,  though  it  may  be  con¬ 
tained  in  a  smaller  number  of  lists.  Each  item  consists  of  a 
sequence  of  catenae;  catenae  are  of  several,  types.  Two  of 
these  types  are  data  catenae  and  associative  catenae.  Data 
catenae  provide  information  not  given  by  any  of  the  descriptors 
represented  in  the  tree.  Associative  catenae  record  a  key  and 
the  next  item  on  the  list  associated  with  that  key.  Thus,  each 
item  has  as  many  list  successors  as  It  has  keys  (unless,  of 
course,  it  is  the  last  item  on  a  key  list). 

A  search  down  the  tree  structure  is  used  to  translate  the  com¬ 
bination  of  descriptors  given  in  a  retrieval  or  change  request  into  the 
address  of  the  first  item  on  a  list  containing  the  items  satisfying  Buch 
a  description.  This  list,  whioh  originates  at  a  bottom  tree  node,  is 
followed  to  retrieve  or  ohange  the  oontents  of  the  items.  The  list  may, 
however,  contain  extraneous  items.  One  advantage  of  this  type  of  storage 
organization  is  the  efficiency  of  retrieval,  since  a  search  is  required 
through  only  a  small  part  of  the  total  storage,  while  duplication  of 
items  is  still  avoided.  Other  advantages  include  the  ability  to  retrieve 
by  partial  description  and  the  ease  of  adding  items  and  descriptors. 
Deletion,  however,  is  less  economical. 
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Any  available  space  can  be  used  to  store  information  -in  the 
Multi-List  system.  The  addresses  of  the  available  spaces  are  kept  in  a 
List  of  Available  Space  (LAS);  when  an  item  is  added  or  deleted,  the  LAS 
is  changed  to  record  the  appropriate  modification.  The  information  struc¬ 
ture  of  the  Multi -List  system  is  such  that  multiple  paths  to  each  item 
are  provided  in  the  storage  space;  namely,  through  the  trees  for  each 
superfield  associated  with  the  item.  The  computer  must  be  programmed  to 
choose  the  appropriate  superfield  when  more  than  one  is  involved  in  a 
retrieval  request. 

Several  assumptions  were  made  with  respect  to  the  organization 
of  the  foemory.  First,  it  is  assumed  that  a  tree  with  the  same  number  of 

branches  emanating  from  each  node  except  at  the  lowest  level  (a  balanoed 

/ 

tree)  oan  always  he  constructed.  A  process  for  generating  these  trees 
is  described.  Another  assumption  is  that  it  is  possible  to  divide  the 
totality  of  descriptors  in  an  arbitrary  information  retrieval  file  into 
attributes.  In  oomplex  problems  this  separation  of  descriptors  into 
oxoluaive  attribute  groups  may  not  be  an  easy  task.  A  process  for  machine 
analysis  of  the  file  to  determine  these  groupings  is  also  desoribed, 

U.U.2.2.3  Maximization  of  Efficiency  -  Different  types  of  file 
organizations  are  usually  compared  on  the  basis  of  operating  time  and 
storage  capacity  to  determine  relative  efficiency.  These  criteria  are 
not  always  the  best  measures  to  use,  since  it  is  often  possible  to  improve 
one  at  the  expense  of  the  other.  One  function  that  overcomes  this  type 
of  difficulty  is  the  product  of  search  time  and  storage  capacity,  which 
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can  be  considered  as  the  cost  of  operating  the  system,  since  storage 
capacity  measures  the  amount  of  equipment  required  and  search  time  meas¬ 
ures  the  time  the  equipment  is  in  use.  In  the  work  on  file  organizations 
by  Hayes  [31]  maximum  efficiency  and  minimum  cost  is  achieved  by  minimizing 
search  timej  a  method  for  computing  average  search  time  is  also  described. 
The  Multi -List  system  employs  a  technique  for  the  maximization  of  effi¬ 
ciency  based  upon  a  minimum  of  the  product  of  storage  capacity  and 
retrieval . 

The  balanoed  tree  is  particularly  well  suited  as  a  decoding 
network  for  retrieval  requests  since  the  searoh  time  is  almost  equal  for 
all  terminal  tree  nodes.  The  ability  to  have  branches  of  the  tree  asso¬ 
ciated  with  monotonically  increasing  numerical  values  makes  the  tree  an 
efficient  tool  for  sorting  an  arbitrarily  arranged  ensemble  of  numbers. 

The  mutual  exclusion  of  descriptors  of  an  item  can  be  used  as  a  criterion 
by  which  the  computer  can  separate  descriptors  into  distinct  attribute 
groups.  The  tree  meahanism  appears  to  be  an  efficient  tool  in  tlia  prooess 
of  establishing  attribute  groups  whose  members  (descriptors)  are  mutually 
exclusive. 

A  balanoed  tree  is  built  by  progressively  adding  more  keys  as 
more  data  items  are  entered  into  the  Multi -List  memory.  Keys  of  the 
first  data  item  to  he  filed  form  the  nodes  of  the  initial  tree.  Keys  of 
subsequent  data  items  are  incorporated  into  the  tree  structure  according 
to  various  rules. 

When  a  new  item  is  to  bo  added,  the  relationship  of  the  new 
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iten  to  all  other  items  in  the  tree  is  determined.  If  any  items  having 
the  same  keys  as  the  current  item  were  filed  previously,  then  the  lists 
are  located  and  the  new  item  is  incorporated  in  the  corresponding  lists 
according  to  the  established  order  (push-down  fashion,  alphabetical 
order,  etc.).  When  only  part  or  none  of  the  required  lists  exist,  new 
lists  are  added.  To  maintain  the  monotonic  order  of  the  key  values,  the 
keys  corresponding  to  the  new  lists  are  entered  at  specified  locations 
in  the  lowest  level  of  the  tree.  If  the  tree  item  corresponding  to  this 
location  oontains  a  vacant  oatena,  then  the  key  corresponding  to  the  new 
list  and  the  address  referring  to  the  new  item  are  inserted  so  as  to 
preserve  the  monotonio  order  of  the  keys.  If  no  vaoant  oatena  is  avail¬ 
able,  a  procedure  that  creates  one  is  invoked.  The  depth  of  the  tree  is 
increased  whenever  the  required  number  of  keys  for  an  attribute  increases 
beyond  a  power  of  the  nuniber  of  nodes  per  level, 

U.U,2.2,U  Automatio  Stratification  of  Information  -  The  use  of 
content  addressed  memories  alone  is  not ■ sufficient  to  solve  the  retrieval 
problem  without  additional  stratification  of  the  descriptor  language.  In 
the  Multi -List  system  the  input  data  are  s e ml -automati c ally  processed  into 
attribute  groups  for  input  to  the  Multi. -List  trees}  this  process  improves 
efficiency  in  terms  of  speed,  storage  capacity,  and  versatility  of 
retrieval.  The  desirod  stratification  of  the  descriptor  language  con¬ 
sists  of  separating  the  entire  vocabulary  into  attributes,  each  consist¬ 
ing  of  a  sot  of  mutually  exclusive  descriptors,. 

In  many  problems,  the  1*9  exists  a  natural  set  of  attributes. 

This  is  true,  for  instance,  of  the  example  discussed  in  Section  ,2„7» 


More  generally,  it  is  necessary  to  discover  such  a  set.  The  fewer  the 
number  of  attributes  needed,  the  more  efficient  the  system  will  be.  The 
lower  bound  on  the  number  of  attributes  is  the  number  of  descriptors  that 
can  simultaneously  apply  to  an  itemj  the  upper  bound  is  the  total  number 
of  possible  descriptors,  and  this  is  usually  quite  large.  The  problem  of 
selecting  exclusive  attributes  is  somewhat  analogous  to  the  problem  of 
orthogonaHzing  a  set  of  vectors  via  a  linear  transformation,  where  the 
veotors  may  be  of  different  dimensions .  Each  item  corresponds  to  a  vector 
with  as  many  components  as  the  item  has  descriptors.  The  minimal  number 
of  attributes  required  is  analogous  to  the  minimum  dimension  of  the  space 
in  vhloh  the  vectors  can  be  made  orthogonal. 

The  Multi-List  system  includes  an  algorithm  for  assigning 
descriptors  to  attributes  at  the  time  that  the  descriptors  show  up 
attaohed  to  input  items.  Thus,  the  attribute  assignment  program  receives 
inputs  at  successive  moments  of  time;  each  input  consists  of  a  set  of 
descriptors,  some  of  which  are  new  and  some  of  which  have  already  been 
assigned  attributes.  Each  of  these  descriptors  must  then  be  associated 
with  a  different  attribute .  Performing  this  assignment  may  be  quite 
complicated,  and  may  involve  the  creation  of  new  attributes.  The  system 
includes  provision  for  assistance  to  the  machine  at  this  task  from  a 
human  being.  The  algorithm  as  stated  appears  to  be  rather  inefficient 
in  terms  of  minimizing  the  number  of  attributes  needed. 

U.U.2,2,5  The  Memory  Synchronizer  -  The  list  organization  of 
the  memory  permits  the  flow  of  data  in  and  out  of  the  memory  on  the  basis 
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of  content  rather  than  location.  This  organization,  which  behaves  like 
an  associative  memory,  would  employ  a  variety  of  storage  devices,  e„g„, 
core  storage  for  fast  access  and  limited  capacity,  drum  of  disc  for  inter¬ 
mediate  access  and  capacity,  and  tapes  for  slow  access  and  large  capacity. 
In  using  a  hierarchy  of  memories  a  coordinating  or  synchronizer  unit  is 
required.  The  memory  synchronizer  is  designed  to  be  incorporated  into 
the  hardware  of  the  list  machine  memory.  It  has  four  basic  instructions: 
read  item,  store  item,  replace  catena,  and  delete  item.  Its  purpose  is 
to  handle  the  memory  space  assignment  of  incoming  or  deleted  data  and  to 
synchronise  the  processor  and  the  memory. 

U.lt.2.2.6  The  Multi-List  Processor  -  The  design  of  a  Multi- 
List  processor  for  this  system  is  approached  in  two  different  ways.  The 
primary  difference  in  the  two  approaoheB  is  that  the  second  approach  uses 
an  instruction  memory  for  storing  micro -instruction  routines j  the  first 
approach  is  based  upon  macro -instructions.  In  iho  first  approaoh  the 
processor  is  developed  from  the  basic  operations  of  transfer  and  oompare. 
In  the  second  approach,  more  complex  processes  were  selected  as  the 
basic  processes— finding,  filing,  and  deleting  an  item  of  infoimationj 
these  processes  require  sots  of  micro-instructions  to  carry  out  each 
function. 


Both  design  approaches  call  for  a  hierarchy  of  memories— for 
example,  a  parallel,  read-only  memory  such  as  the  UN1VAC  search  memory 
for  high  speed  operations,  and  a  slower  access  memory  for  storing  the 
mass  of  data.  These  design  approaches  deal  mainly  with  programming  and 
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hardware  to  implement  the  Multi -List  system  and  need  not  he  described 
in  detail. 


U.U.2.2.7  Sample  Problem  -  Consider  a  personnel  file  of 
approximately  10^  items.  The  file  contains  the  names  of  the  personnel 
and  their  descriptions  in  terms  of  a  fixed  set  of  attributes  or  categories 
of  information.  The  description  will  be  made  up  of  1$  exclusive  attributes 
where  each  attribute  can  have  a  fixed  number  of  values.  Ten  values  per 
attribute  are  assumed.  The  2$  attributes  used  for  this  problem  are  a a 
fellows* 
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The  values  of  the  attributes  (or  descriptors)  will  be  represented  by  the 
digits  0,1,2, « „ „,9 .  The  attributes  are  grouped  into  superfields,  in 
which  each  superfield  represents  3  attributes}  hence,  there  are  five 
superfields.  This  is  done  in  order  to  represent  these  attributes  effi¬ 
ciently  in  a  tree  structure*  Three  descriptors  per  superfield  will  give 
the  values  of  the  attributes}  the  combined  values  form  a  key.  For  eaoh 
key,  the  values  range  from  0  to  999. 

An  item  of  information  such  as  a  person's  name  and  description 
is  represented  in  this  system  by  two  types  of  catenae!  data  catenae  and 
associative  catenae,  The  data  catenae  contain  the  name)  the  associative 
oatenae  contain  the  descriptors  arid  addresses  associated  with  them.  The 
attributes  have  positional  significance  in  an  item,  as  shown  in  Figure  10. 
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FIGURE  10,  Relation  of  Data  Catena  and  Associative  Catenae 

The  tree  structure  for  this  example  is  shown  in  Figure  11.  The 
upper  part  of  the  diagram  represents  the  tree  structure,  and  the  lower 
part,  represents  the  multi -association  area.  Each  point  in  the  tree  repre¬ 
sents  the  multi  "association  area.  Each  point,  in  the  tree  represents  a 
set  of  keys,  some  of  which  are  explicitly  indicated  in  the  diagram.  The 


FIGURE  11.  Multi -List  Organization  for  a  Personnel  File 


numbers  in  parentheses  associated  with  each  node  (in  either  area)  represent 
hypothetical  memory  locations ;  these  are  'used  for  illustrative  purposes 
in  the  sample  problem.  A  trace  can  be  made  on  .any  value  of  any  one  of  the 
5  keys |  the  trace  will  lead  to  a  node  on  the  lowest  level  of  the  tree. 

At  this  level  the  address  of  the  head  of  a  list  containing  all  items  hav¬ 
ing  that  same  key  will  be  retrieved,  The  intersection  of  the  lists  for 
each  key  contained  in  the  item  will  yield  the  appropriate  item,  Figure  12 
illustrates  the  partial  contents  of  the  Multi -List  memory  for  the  sample 
problem.  The  arrows  indicate  the  path  to  be  taken  if  a  searoh  is  made  on 
the  key  for  Superfield  I  (116)  in  order  to  arrive  at  the  appropriate  item. 

U.li.2.3  Summary  and  Evaluation  -  The  Multi -List  system  for 
information  retrieval  utilizes  a  conventional  memory  to  simulate  an 
associative  memory,  thus  gaining  some  of  the  advantages  of  eaoh.  It 
employe  a  novel  memory  organization  that  incorporates  both  a  conventional 
tree  atruoture  and  an  unconventional  list  structure;  the  list  struotwo 
differs  from  most  others  in  that  eaoh  a  single  element  may  actually  appear 
as  part  of  several  lists.  This  is  accomplished  by  permitting  an  element 
to  have  several  distinct  list  successors.  In  the  Multi -List  system, 
searching  da  extremely  rapid  and  searching  on  at  least  some  types  of 
partial  description  can  be  performed  with  no  loss  of  time;  if  the  par¬ 
tial  descriptions  to  be  used  can  be  anticipated  in  advance,  then  the  mem¬ 
ory  can  be  organized  to  handle  them  efficiently.  There  has  been  considerable 
inv  stigation  of  machine  organizations  and  logic  that  can  handle  the 
Multi -List  system  efficiently. 


Ota  paper,  the  system  appears  quite  reasonable .  However,  it 
cannot  be  operated  on  most  conventional  computers  without  significant 
loss  of  efficiency.  The  problem  of  efficient  deletions  remains  unsolved „ 
Difficulties  arise,  also,  in  organizing  data  into  the  attribute -value 
descriptors  used  by  the  system.  It  is  necessary  to  structure  the  data 
so  that  the  number  of  attributes  will  not  be  unduly  large,  and  no  really 
general  way  of  doing  this  has  yet  been  found. 

In  a  paper  on  automatic  stratification  of  information  presented 
at  the  1963  SJCC  [1*2]  a  hand-simulated  example  is  given  using  natural 
'language  (represented  by  a  2-digit  code).  This  simulation  has  also  been 
programmed  for  the  IBM  7090  using  artificial  input  and  a  small  amount  of 
ASTIA  (or,  presently,  DDO)  live  data.  This  technique  looks  promising  for 
at  least  oertain  types  of  information  retrieval  problems  once  the  tech¬ 
nique  is  fully  developed.  As  is  generally  the  case,  the  examples  used 
have  a  limited  soope  j  a  great  deal  of  development  is  required  before 

l 

the  concept  can  be  praotioably  implemented.  The  question  remains  as  to 
whether  all  types  of  information  retrieval  data  will  be  adaptable  to 
descriptor/attribute  stratifio&tlon. 

I*. $  QUERY  PROCESSING 

In  an  important  sense  the  answers  to  all  the  preceding  questions 
determine  in  large  measure  the  query  capabilities  of  a  system.  Conversely, 
the  descriptor  and  processing  structures  must  be  designed  to  accommodate 
query  requirements.  The  state-of-the-art  in  query  capabilities  of  operat¬ 
ing  information  retrieval  systems  at  the  inception  of  this  project  was 
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FIGURE  12.  Example  of  Multi -List  Momory  Contents  for  Figure  ]1 
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FIGURE  12  „  Example  of  Multi -List  Memory  Con-tents  for  Figure  11  (Continued) 


limited  to  locating  the  documents  that  satisfy  some  level  of  Boolean 
concatenation  of  descriptors.  Most  of  the  early  work  on  the  project 
implicitly  assumed  essentially  such  a  query  capability.  Since  this 
approach  is  well  understood,  it  will  not  be  considered  further. 

For  a  descriptor-oriented  retrieval  system  in  which  documents  have 
probabilities  attaohed  to  each  descriptor,  a  mere  matching  procedure  is 
not  suitable  for  query  processing.  One  method  of  treating  this  situa¬ 
tion  is  to  assign  probability  thresholds  for  the  different  descriptors j 
the  assignment  will  be  dependent  upon  the  nature  of  the  query. 

One  of  the  major  problems  in  generating  the  appropriate  response  to 
a  query  is  the  existence  of  redundancy  in  the  retrieved  data.  In  certain 
applications,  such  as  personnel  file  processing,  this  problem  poses  no 
appreciable  difficulty .  In  literature  retrieval  or  intelligence  anal¬ 
ysis,  the  problem  may  become  aoute.  Therefore,  analysis  of  the  redundancy 
problem  is  a  cogent  necessity. 

It  was  pointed  out  in  Seotion  U.l.5.3  that  query  processing  is  a 
non«trivi*l  problem  in  dealing  with  intelligence  data,  but  much  lees  of 
a  problem  In  simpler  situations.  In  the  most  difficult  situation,  the 
system  must  be  designed  around  the  concept  of  a  dialogue  between  the 
system  and  the  user.  In  addition,  the  full  power  of  an  implicit  infor¬ 
mation  system  may  be  necessary.  In  this  section,  both  of  these  aspects 
of  query  processing  will  be  discussed, 

U.5.1  Probabilistic  Retrieval  -  The  purpose  of  this  section  is  to 
present  a  method  for  deciding  which  documents  should  be  retrieved  in 
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response  to  a  query,  Riven  that  a  description  consists  of  a  list  of 
non -exclusive  category  names,  when  documents  are  assigned  to  categories 
probabilistically  rather  than  absolutely.  The  decision  algorithm  will, 
be  developed-  on  the  basis  of  maximizing  a  value  function  that  measures 
the  goodness  of  the  set  of  retrieved  documents.  Before  proceeding 
further,  however,  it  will  be  helpful  to  examine  some  specific  situations 
in  whLoh  probabilistic  retrieval  would  be  appropriate , 

(a)  The  Case  of  Many  Users  -  A  situation  may  occur  where  the  views 
of  users  regarding  membership  of  some  documents  in  a  certain 
category  are  divergent,  Assume,  for  example,  that  there  are 
100  users,  $  categories,  and  10  documents.  Each  user  is  asked 
to  assign  each  document  to  one  or  more  categories.  Table  2 
illustrates  a  possible  set  of  choloes.  The  nuiribers  at  the 
Intersection  of  rows  and  columne  indicate  the  probability  of 

a  document  belonging  to  «v  certain  oategory.  Thus  document 
No,  10  will  belong  to  oategory  D  with  probability  1,  sinoe  all 
the  users  agree  to  place  it  there.  On  the  other  hand,  the  same 
document  will  have  a  probability  of  zero  of  belonging  to  category 
Bj  again,  all  the  users  agree  to  exclude  it  from  this  category. 
Sinoe  percent  of  the  users  agreed  to  place  document  No.  10 
in  category  A,  it  has  been  assigned  a  probability  of  ,U5. 

(b)  Automatic  Category  Formation  -  Documents  may  be  assigned  to 
categories  in  accordance  with  an  automatic  procedure.  This 
procedure  may  be  intrinsically  probabilistic  in  nature;  that 
is,  a  document  is  assigned  to  a  category  with  probability  p 
dependent  upon  the  circumstances  pertaining  to  the  assignment. 


TABLE  2 .  PROBABILISTIC  ASSIGNMENT  OF  DOCUMENTS  TO  CATEGORIES  BY  USERS 
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The  specific  response  to  a  query  will  be  determined,  through  the  use 
of  one  or  more  cutoff  points .  For  retrieval  on  a  single  category,  doc¬ 
uments  belonging  to  the  category  with  a  probability  greater  than  or  equal , 
to  the  cutoff  point  will  be  included  in  the  response j  all  others  will 
be  exoluded.  For  queries  specified  as  Boolean  functions  of  categories, 
multiple  outoff  points  will  be  needed,  one  for  oach  category  involved  in 
the  query.  The  selection  of  outoff  points  will  be  performed  in  such  a 
way  as  to  maximize  the  goodness  of  the  response.  The  following  questions 
must  then  be  answered? 

(a)  How  is  the  goodness  of  a  response  to  be  determined  quantitatively? 

(b)  How  is  the  cutoff  point  for  a  simple  (i.e#,  one-category)  query 
to  be  determined? 

(o)  How  are  the  outoff  points  for  a  compound  query  to  be  determined? 
These  questions  will  be  considered  in  the  sequel. 

U. 5.1.1  The  Problem  of  Establishing  Criteria  for  Determining 
User»s  Value  of  An  Average  Retrieval  Procedure  -  With  respect  to  any 
retrieval  request  the  entire  collection  of  documents  may  be  divided  into 
four  subgroups: 


(a.)  The  retrieved  documents  that  are  relevant. 

(b)  The  retrieved  documents  that  are  not  relevant. 

(c)  The  unretrieved  documents  that  are  relevant. 

(d)  The  unxetrieved  documents  that  are  not  relevant. 

Since  it  was  assumed  that  the  documents  are  assigned  to  categories  on  a 
probabilistic  basis,  all  four  subgroups  will  generally  be  represented  in 
arqr  retrieval  prooess. 

Regardless  of  any  special  assumptions,  it  is  clearly  permissible 
to  assert  that  as  the  number  of  documents  in  categories  (a)  and  (d) 
increases  and  as  the  number  of  documents  in  categories  (b)  and  (c) 
decreases,  the  value  of  the  retrieved  collection  to  the  user  will 
increase.  Thus, 

V  -  f-jft}  -  f2{ll)  -  f3{m)  +  fjjl7}  +  K  (1$» 

where  V  is  defined  as  the  user  value  of  the  retrieved  collection;  fl» 
fg,  fj,  and  f^  are  unspecified,  monOtonioally  increasing  functions;  and 
{I},  {II},  {ill},  and  {IV}  are  the  numbers  of  documents  in  the  subclasses 
(a),  Cb),  (o),  and  (d),  respectively.  K  is  defined  as  a  constant  that 
determines  the  minimal  value  for  the  user  below  which  the  retrieval  is 
not  justified  under  any  circumstances. 

For  simplicity,  replace  fl>  f2'  f3’  and  fj^  by  the  constants 
a,  0,  Y»  811(1  se,fc  K  n  0.  The  results  of  this  discussion  are  not 

essentially  modified  by  this  simplification.  Equation  (lf?3)  then  becomes: 

v  -  a{i]  -  b£ii}  -  vfni}  +  6{I7}  (1$4) 

Sines  K  c  0,  the  retrieval,  process  should  proceed  as  long  as  the  increment 
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of  V,  dV,  is  positive.  That  is,  the  process  may  select  a  group  of  documents 
with  common  probability  characteristics  (in  relation  to  the  request  pro¬ 
file)  and  then  investigate  the  change  of  V  by  including  some  additional 
documents  with  lower  probability  characteristics.  The  question  as  to 
which  documents  will  be  retrieved  is  the  problem  of  fixing  the  most  advan¬ 
tageous  values  for  the  set  {a^}  of  cutoff  points  for  the  descriptor  classes. 

The  appropriateness  of  replacing  the  functions  fl»  f2,  f^,  and 
f^  by  the  constants  a*  B»  Y,  and  fi  rests  upon  the  understanding  of  what 
factors  could  be  responsible  for  the  non-linearity  of  the  function  V, 
Essentially  there  are  two  reasons  why  the  function  ¥  should  be  non-linear. 
The  first  pertains  to  the  eoonomios  of  using  documents)  the  other,  to 
the  problem  of  redundancy.  In  general,  the  efficiency  with  which  the 
retrieved  collection  is  used  depends  upon  its  size,  even  if  the  value  of 
the  individual  documents  in  the  collection  iB  not  prejudged.  Nevertheless, 
since  retrieval  systems  can  be  used  in  various  ways,  it  is  safe  to  assume 
that  for  many  uses  the  relative  emphasis  placed  upon  the  classes  of 
retrieved  and  unretrieved  documents  remains  unchanged.  To  the  extent 
that  this  assumption  is  true,  the  fact  that  the  f motion  V  depends  upon 
olass  f IV),  the  class  of  correctly  unretrieved  documents,  helps  to  remedy 
the  situation. 

The  second  reason  for  non-linearity  is  more  serious .  Among  the 
retrieved  documents  there  may  be  a  high  degree  of  redundancy)  in  some 
cases  the  same  amount  of  information  may  be  entirely  covered  by  a  smaller 
number  of  documents.  It  is  difficult,  however,  to  decide  whether  or  not 


redundancy  is  a  linear  function  of  the  size  of  the  retrieved  collection* 

To  answer  this  question  adequately,  it  would  be  necessary  to  formalize 
the  concept  of  redundancy  among  documents  and  then  perhaps  to  formulate 
theoretical  prescriptions  for  procedures  that  would  permit  the  system  to 
retrieve  the  most  efficient  covering  of  the  topic,  specified  in  the  request* 
(This  problem  is  a  difficult  task  in  itself  and  merits  separate  investiga¬ 
tion.)  Pending  a  quantitative  formulation  of  the  theory  of  redundancy, 
this  discussion  will  be  confined  to  the  simplest  assumption  of  linearity, 
Therefore,  given  the  function  V  in  the  form  of  Equation  (l£U),  the  first 
task  is  to  find  the  set  of  cutoff  points  that  will  maximize  the  user's 
value  for  an  average  retrieval  process. 


U.5.1.2  Determination  of  Cutoff  Points  for  Simple  Queues  -  We 
start  by  introducing  some  notation.  We  assume  that  there  are  s  categories, 
denoted  by  the  integers  i  "  1,2, . , „,s ,  To  facilitate  computation,  the 
number  of  documents  in  eaoh  class  are  assumed  to  be  large  enough  and  the 
subdivision  into  the  probability  brackets  fine  enough  to  permit  integra¬ 
tion  techniques  to  replace  summation.  Lets 

N.  (p)  l*  the  number  of  documents  in  category  i  with 
probability  p  or  less. 


yp) 


ayp) 


Pi(d)  -  %  J  5i;i(p)pdp. 


?(1#) 


pi  ='  Pi/1) 


.fl  c  the  frequency  with  which  category  i  is  requested. 


llt.3 


-  the  cutoff  point  for  category  i. 

N  =  the  total  number  of  documents  in  the  collection. 

If  we  assume  that  eveiy  document  belongs  to  every  category  with  at  least 
some  non-zero  probability,  then  we  have: 

;  1^(0).;  -  0 

and} 

-i 

N^l)  -  N 

We  also  assume  that  n^(p)  is  non-zero  throughout  the  interval  [0,1}, 
since  its  value  oan  always  be  made  sufficiently  small  to  be  statistically 
insignificant. 

The  quantity  p^c)  represents  the  expected  proportion  of 
incorrectly  unretrieved  documents  when  retrieval  is  performed  with  cut¬ 
off  point  cr,  that  is: 

-  /  v  u  incorrectly  unretrlayed  doouments 

*,i'°r'  total  doouments  in  the  collection 

To  follow  tills  point,  note  that  for  0  <  p  <  cr,  the  expeoted  nuiriber  of 
doouments  in  the  interval  (p,  p  +  dp)  is  n^pjdp,  and  that  p  of  these 
doouments  will  actually  belong  to  category  i.  Thus  the  number  of  doou¬ 
ments  in  the  interval  belonging  to  i  will  be  pn^pjdp,  and  since  p  <  a, 
none  of  these  documents  will  be  retrieved.  Since  these  documents  do  in 
fact  belong  to  category  i,  they  are  incorrectly  unretrieved.  P^(a)  is 
obtained  by  integrating  pn^(p)dp  over  the  interval  from  0  to  cr,  thus 
covering  all  incorrectly  unretrieved  documents.  Note  also  that  repre¬ 
sents  the  expected  proportion  of  documents  in  category  i,  since  with  a 
retrieval  threshold  of  certainty  no  documents  wi.ll  be  retrieved}  hence 


all  documents  in  category  i  will  be  incorrectly  unretrieved. 


¥e  note  also  that  from  Equation  (155) : 


(156) 


The  procedure  for  calculating  the  set  of  a^'s  that  will  maximize 

V  is: 

(a)  Calculate  the  numbers  of  documents  for  the  four  subclasses  of 
documents  that  enter  V  for  an  unspecified  cr^. 

(b)  Obtain  a  general  expression  for  V  for  a  single  category, 

(o)  Obtain  an  expression  for  the  expected  value  for  all  V's. 

(d)  Differentiate  the  expression  obtained  under  (o),  and  set  the  a 
ooeffioients  of  the  differentials  equal  to  zero  in  order  to 
obtain  a  set  of  conditions  for  the  maximum. 

(e)  Solve  the  equations  to  obtain  the  values  of  the  -s. 

We  will  permit  different  for  different  categories. 


We  first  calculate  the  number  of  documents  in  each  subclass : 
(a)  Class  I  -  The  olass  of  all  correctly  retrieved  documents: 


P  ^ 

(!)  »  J  pni(p)dp 


(157) 


(b)  Class  II  -  The  class  of  all  incorrectly  retrieved  documents: 
1 


{II}  “  J  (1  -  p)  n±(p)dp 


(150) 


(c)  Class  III  -  The  class  of  all  incorrectly  unretrieved  documents: 

0 


{III}  =  J  1  pn^pjdp 


(159) 
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(d)  Class  IV  -  The  class  of  all  correctly  unretrieved  documents: 


[IV]  =  [  1  (1  -  p)  n, (p)dp 
Jo  1 

For  a  query  on  categox’y  i,  then,  we  have: 

?i  -  a  J  pn.±(p)dp  -  f  J  (1  -  p)  n±(p)&p 


(160) 


-  Y  J  1  F^Cp)^  +  6  J  1  (1  -  p)  n1(p)dp 


(161) 


The  expected  value  of  V  over  all  categories  is  obtained  as  a  weighted 
sum  t 


V  -  E  £,  V, 
i-1  1  1 


E  f^a  J  pn^pjdp  -  8  J  (1  -  p)  n^pjdp 


i-1 


1  P^(p)dp  +  5  J 

'  0  o 

The  conditions  for  a  maximum  are  obtained  by  setting  the  partial  derlva- 


-  Y  J  1  P^(p)dP  +  5  J  1  (1  -  Pj)  J\(p)dp] 


(162) 


tives  with  respect  to  each  to  0: 


^C-  a  ^(oj^)  +  5(1  -  c^)  ^(Oj^)  -  y  a±  ih^) 

+  ft(l  -  cTj^)  n^o^)]  -  0  (163) 


Dividing  by  f  (oi )  yields: 


■so. 


+  5-  5ai-Ydi  +  5-  6CTiB0 


ai(a  +  5  +  y  +  6)“P  +  5 


(16U) 


li.i.6 


so  that; 


ai 


3  »  8 _ 

a  +  3  +  y  4'  5 


The  quantity  (165) ,  then,  is  the  optimal  cutoff  point  for  single-descriptor 
queries.  It  is  of  interest  to  note  that  the  cutoff  point  is  the  same  for 
all  categories  and,  in  faot,  does  not  even  depend  on  the  probability  dis¬ 
tribution  of  doeuments  within  the  categories. 


U. 5.1.3  Determination  of  Cutoff  Points  for  Compound  Queries  - 
We  now  consider  queries  that  are  of  the  form  •  o^j  that  is,  we  seek 
documents  that  belong  both  to  category  i  and  to  category  j .  In  general, 
the  thresholds  to  be  used  on  the  Individual  categories  will  be  different 
for  joint  retrievals  than  for  simple  ones.  We  will  Initially  assume  that 
the  distributions  of  dooumants  within  categories  are  independent)  that 
is,  that  the  membership  of  a  document  in  category  i  does  not  affect  ths 
probability  of  its  membership  in  oategory  j.  We  will  also  require  that 
a  single  cutoff  point  be  established  for  eaoh  category  given  that  the 
query  is  of  the  form  •  Oy  As  part  of  our  independence  assumption  we 
will  asstune  that: 

fiJ  '  fl  f3  O'66* 

Thus  the  frequency  of  retrieval  on  a  joint  oategory  is  the  product  of 
the  frequencies  on  the  individual  categories.  Under  ttese  assumptions, 
we  can  carry  out  the  analysis  in  the  same  way  that  we  did  for  simple 
queries „ 


U. 5. 1.3.1  Development  of  the  Cutoff  Point  Equations  -  We  let 
^ijCPi’Pj)  ^®n0^e  'kbe  cumulative  joint  distribution  function  for  categories 
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I  and  therefore,  N.  (p. ,p.)  represents  the  number  of  documents  that 

ij  i  D 

belong  both  bo  category  i  with  probability  p^  or  less  and  to  category  j 
with  probability  or  less.  We  let  n^(p^,Pj)  represent  the  correspond¬ 
ing  density  function,  where t 


(16?) 


Similarly,  we  let  P^j(Pj_>Pj)  denote  the  average  probability  of  a  docu¬ 
ment  belonging  to  both  category  i  and  oategory  i,  given  that  the  document 
belongs  to  oategory  1  with  probability  p^  and  oategory  j  with  probability 

pr 

The  assumption  of  independence  of  oategories  o&n  be  broken  down 

■  J  i 

into  two  separate  mathematical  statements  t 


Ni.1^pi»pP  _  Vpi^  .  VPP 

— w  1  jjjf  ■  u 1  ~y  •  -*Y  " 


(168) 


and,] 


piA*Vpipj  (m 

These  statements  are  independent  in  the  sense  that  neither  can  be  derived 
from  the  other,  and  they  represent  two  different  aspeets  of  independence 
of  oategories.  As  a  consequence  of  Equations  (l6y)  and  (168),  we  obtaini 


(170) 


for  independent  categories. 


We  can  write  expressions  giving  the  number  of  documents  in  eaoh 


of  the  four  classes  involved  in  the  value  function  V. 

ij 


(a)  Glass  I.  -  The  class  of  all  correctly  retrieved  documents; 


1  *  1. 


W  ’  J.  I  pi.1(pl>;p3)  niA'pJ)dp;)  d% 


(171) 


(b)  Glass  II  -  The  class  of  all  inoorrectly  retrieved  documents  s 

1  A  1 


tin  ■  J  |T  ti  -  pjj(^.p3m  Vpi>pj)<ipj 


(172) 


°i 


(c)  Class  III  -  The  class  of  all  incorrectly  unretrieved  documents : 


fill)  -  l’1  J“®3  PytPi.Pj)  n1J(p1,PJ)dpJ  dPl  (173) 


(d)  Class  17  -  The  olaas  of  all  oorreotly  unretrieved  documents  j 

M  "  Jq  1  Jq  i  C1  “  Pij(Pi»Pj)}  nij(Pi»Pj)dPj  (17U) 

Since  we  are  oaountlng  independence  of  categories,  vs  can  simplify  Equa¬ 
tions  (171)  through  (17U)  hy  using  Equations  (16?)  and  (170); 


(a) 

(1}  - 

1 

pi  pi  Mp*)  •  ni(Pi) 

l  Ja  —  TT3*1-  pi  PJ  dpJ  dpi 

1  j 

( 17 $) 

Cb) 

Cn)  - 

r 1  r 1  w  •  nA> 

Ja  Jo  -  Mf "  '  "  “  pipj)dpj  dpi 

*  <1 

(176) 

(a) 

{m}  ■ 

pai  r° i  *V(Pi)  •  ni(p-) 

*1/1/  — ^  dpj  dpi 

(177) 

(d) 

{iv}  - 

r  o,  -a,  n(p. )  •  n(p.) 

I  I  3  H  3  <d  -  %ppdp3  dpi 

n  rt  v  v 

(178) 

The  retrieval  process  proceeds  until  the  predetermined  cutoff 
point  for  descriptor  i  and  for  descriptor  j  has  been  reached.  To 
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retrieve  beyond  this  point  will  be  detrimental,  since  on  the  average  the 
increment  in  V  caused  by  additional  retrieval  will  be  negative. 


The  four  double  integrals  in  Equations  (175)  through  (178)  can 
now  be  evaluated,  For  Equation  (175) 8 
P1  fl  n.  (p.)  •  n.(p.) 

W  "  J  J  %  Pj  dP3  % 


ai  aj 


1  X 

■  1 1  n(pi>i  dpi  J  dpj 

si  di 

■  N[p^  -  CPj  - 

Similarly,  Equations  (176)  through  (178)  become t 

fl  pi  n4(Pj)  •  hj(pj) 

[XI]  -  J  J  -ft" a LyJ-L  (l  -  PjPjJdp^  dp± 

°i 

-  fl  {[N  -  N^)}  [N  -  Nj (Oj)l 

-  N2^  -  p^)!  [Pj  - 

{III}  .  J  "i  J  ni(pi)  *  nl(pj))pi  pJ  dpJ  dpi 


O  0 


-Kp^PPjfej) 


n  Oj  pO,  n,  (p. )  •  n.(p.) 

OT  -  Jo  Io  j  <x  -  PiPj)dP3 


(179) 


(180) 


(181) 


J[Ni(ai)Nj(a;))  -^p^p^)] 


(182) 


By  substituting  Equations  (179)  through  (182)  into  Equation  (154), 


I 

I 

i 

t 

! 

i 

I 


tha  function  V..  .  for  the  value  of  a  joint  retrieval  on  categories  i  and 
j  becomes 

Vu  “itN2  Cpi-PiM  [5j 

-  |  {[N  -  N^s-n  [N  -  NjC^)] 

-  N2  Cp±  -  p^o^l  [Pj  -  PjCcTj)!}  )  (163) 

-i 1,2 

+  I  CNi(ai)  Nj(aj)  -  N2  p fa  )  J 

By  using  Equation  (183),  it  is  possible  to  find  the  values  of 
and  Oj  that  will  maximize  a  epeoiflo  V^y  In  general,  however,  the 
values  o^'  and  o^"  obtained  by  solving  the  maxima  in  expressions  and, 
say,  Vik  will  be  different.  Consequently  we  need  a  set  of  values  {o^) 
that  will  maximize  an  average  . 

The  average  value  of  is,  of  course,  its  expected  values 

E(V)  -  E  E  V,  .  f,, 

i-1  j-l 


s  s 

E  E  V  f  f 
i-1  3-1  13  1  3 


(184) 


since  f^  ■  f^  f..  by  Equation  (166),  and  this  function  will  have  to  be 
maximized.  The  differential  of  Equation  (184)  Iss 

-av, 3V. 


a  rov.  -i  o»j  ,i  -i 

dE  -  E  E  f .  f .  -j; -  do.  +  -  do .  I 

i=l  j”l  1  ^^i  °°j  -T-* 


(185) 
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which  implies  the  following  condition  for  a  maximum: 


E  f 
3 


fii 

1 


0 


(i  -  1,2, 


(186) 


The  partial  derivatives  in  Equation  (186)  can  be  com¬ 


puted  by  using  Equations  (179)  through  (187) * 

iSp  ■  -  W’  K  V’i)”  w) 

^3  -  |  C-H  *  Mjtoj)! 

+  CPj  -PjCff^)!  (188) 

«*(<>*)  :  (18« 

•I 

^sjp-  -  I  CN^Oj)  n^)  -  N  P^j)^  n^))  '  (190) 


Performing  the  summations  in  Equation  (186)  on  Equations  (187)  through 
(190)  results  in: 


E  f 
3 


3 


E  £ 
3 


3 


±  '  -  'l  £  r3  Rj  ' 

-  £  n^)  £  [-  »  ♦  HjtojHfj 

*  "iV"!1  4  -5J(oJ)51J 


(191) 


(192) 
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(193) 


T,  “  a  n  (o  )  Z  f .  p(o ,) 

A  1  X  1  3-1  3  J 

sfrirl  n-!  (°-s  )  5 


-  c%  n^)  ^  fj  Pfa3) 

Therefore,  the  condition  for  a  maximum  is  given  by  the  equations i 

-St7ini(!7i) 


(I9h) 


+ 1  niK>  £N  -  W* j 


-  £  o<  nA(oA)  £  CPj  -  PjCa^lfj  H 


3-1 

s 


(195) 


S  ai  nivcV 

“  if  °i  ni(cr±)  ,j.  £j N 
£  f3  W 

-  8  0i  ni(oi)  Jj  fj  "3  5<(,3)  "  0 

for  1  ■  l,2,,,.,s.  It  remains  to  show  that  a  solution  actually  exists, 
and  to  examine  the  properties  of  the  solution. 


U. 5. 1.3 -2  Existence  of  Solutions  to  the  Cutoff  Point  Equations 
In  order  to  got  some  insight  into  the  situation,  set  y  =  6  a  Oj  i.e,, 
assume  that  the  function  V  depends  only  upon  classes  {1}  and  {II}.  In 
this  case,  Equation  (195)  is  simplified  tos 


- 1  °±  ”i <«•>  fj  *  [p.i  - 


(196) 


-  |  Oj.  Vi*  ^  [p,  -  PjfejJJtj  K  -  0 

for  i  ■  1,2, ...,s.  After  rearranging  and  dividing  by  the  common  factor, 
n±(a±)/S,  Equation  (196)  becomes : 


S  E  f,[N-N,(o,)] 

3-1  3  3  3 _ 

(o  +  0)[n  E  fj  [pj  -  Pj  Cctj)3] 


(197) 


for  i  ■  1,2,..,, st 


Prom  Equation  (197)  it  follows  that  if  a  solution  exists  at  all, 
then  it  is  the  same  for  all  1,  since  the  right  side  of  this  equation  does 
not  depend  on  i.  If  we  lett 

h(o)  -  -  a  a  N  E  fjCp^  -  Pj(a)] 


+  0  E  f,[N  -  N,(o)|) 


i 


dua  r 


-  0  a  N  E  f  .[p,  -  Pj (cr)l 


i 


(198) 


then  we  can  rewrite  Equation  (196)  as;' 

n. (a)  ,  k 

•  h(a)  -  0  (199) 

We  need  to  show  that  there  exists  a  a  such  that  0  <  a  <  1  and  h(c)  “  0. 
Given  this  a,  then  cr^  n  a0  3  . . .  =  ag  =  a  will  be  a  non-trivial  solution 


15U 


of  Equation  (196).  We  demonstrate  this  fact  by  showing  the  following”. 


li(0)  >  0 

(200A) 

h(l)  =  0 

(200B) 

h'(l)  >  0 

(200C) 

It  is  sufficient  to  show  Equation  (200A),  since  from  Equations  (200B)  and 
(200C),  h(o)  <  0  for  a  a  1  -  €,  where  €  is  positive  and  sufficiently  small. 
The  result  then  follows  from  the  Intermediate  Value  Theorem. 

From  Equation  (198)  we  haves 

h(0)  -  6  E  fjCN  -  Nj(O)]  -  3  N 

.I 

whioh  demonstrates  Equation  (200A).  Also,  sinoe  p^  H  p.j(l)  a™*  N  ■  Nj(l), 
clearly  h(l)  ■  Oj  so  Equation  (200B)  is  true.  Finally, 

a2n.(c?) 

h'(o)  -  -  a  N  E  fjCPj  -  P3(o)  -  ■  ■  f  ■  ■  ■] 

+  0  E  fjC-  n^(cr)  -  tfpj  +  Npj(a)  +  tAj(a)l 


SO} 


h'(l)  -  alfj  xVj(l) 

u 


(201) 


Since  n.(a)  has  been  assumed  to  be  striotly  positive  in  the  unit  interval, 
it  follows  from  Equation  (201)  that  h'  (1)  >  0,  so  that  Equation  (200C) 
holds.  Hence  a  solution  to  Equation  (196)  does  in  fact  exist.  It  can 
similarly  be  shown  that  a  solution  to  Equation  (195)  exists,  provided 
that  ft  is  not  too  large.  The  details  will  not  be  given  here. 


U .5.1.3 .3  Further  Analysis  of  the  Cutoff  Point  Equations  for 


y  -  ft  B  0  -  We  now  let: 


(202) 


gN(ff)  =  E  fjN-N.(a)l 
j=l  J  J 

’ 

s  _  _ 

g -(a)  =  N  E  f.,[p,  -  p,(a)3 
p  ■  j-l  J  J  J 

Then  Equation' (197)  becomes! 

9  %(o) 

°  '  (a  T  ;?n=(cfT 

and  this  equation  can  be  solved  for  a,  as  we  have  shown. 


(203) 


Sinoe  N(o)  is  a  monotonloally  increasing  function  of  a,  it  is 
now  possible  to  Interpret  the  value  of  a  established  in  Equation  (203). 

It  is  apparent  that  gjj(a)  represents  the  average  or  expected  number  of 
retrieved  documents.  On  the  other  hand,  each  term  of  g^(o)  represents 
a  produot  of  the  average  probability  of  retrieved  documents  times  the 
size  of  the  descriptor  group  normalized  by  the  frequency  of  usage  of 
this  descriptor.  ThuB  the  g-(a)  function  expresses  the  average  number 
of  retrieved  documents  properly  belonging  to  the  average  descriptor 
weighed  by  its  frequency  of  occurrence .  It  is  thus  seen  that  the  optimum 
a,  expressed  by  Equation  (203),  is  a  function  of  the  constants  o  and  g, 
whioh  express  the  relative  importance  attached  to  the  correctly  and 
incorreotly  retrieved  documents;  the  optimum  a  is  also  a  function  of 
two  averages— namely,  g^(o)  and  g^(a) . 

It  is  evident  that  the  higher  the  value  of  g — that  is,  the 
importance  attached  to  incorrectly  retrieved  documents— -the  liigher  will, 
be  the  value  of  a.  And  as  o  increases,  fewer  documents  will  be  retrieved. 
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On  the  other  hand,  the  higher  the  value  of  a— that  is,  the  importance 
attached  to  the  correctly  retrieved  documents— -the  lower  will  be  the 


value  of  a.  For  lower  values  of  a  more  documents  will  be  retrieved. 
The  function  decreases  with  the  increment  of  value  of  a,  and  so  does 
g^(a) .  When  a  ■  Os 


g»(0)  -  N  E  f  -  N 

H  $  3 

Br(0)  ■  N  T  f,  p 


d 


d  rd 


(20li) 


and  when  a  ■  Ij 

%(1)  ■  tfjfr)  ■  0 

Thus  at  o  ■  0$ 


(20?) 


P  ,  B 

(a  +  “py  g-(6)  (a  +  Ffnj  ^ 


(206) 


3 

To  evaluate  the  expression  for  a  °  1,  L'Hopital's  rule  must  be  used 
beoause  of  the  indeterminacy  of  O/Ot 


as  tj  -*1 


ejjto)  %(!) 

%(o)  ■  -  *  fj  "jW 


gp(o)  -  -  a  E  fj  nj(cr) 
d 


(207) 


Thus  at  a  -  Is 


$  gj^(d)  g 

(a  +  B)  e^(a)  “  a  +  B  ^  1 


(208) 
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From  Equations  (206)  and  (208)  it  follows  that  the  optimum  a  never  lies 
at  the  extrema  of  tho  unit  interval. 


For  simple  queries,  it  follows  from  Equation  (l6£)  that  for 
Y  ■  6  =  0,  the  cutoff  point  is  the  same  for  all  categories  and  is  given 
bys 

a  "  5~+T  (209) 


For  joint  retrievals,  we  have! 


a 


Sinoe 


we  see  thats 


(210) 


(211) 


(a)  The  cutoff  point  for  joint  retrieval  on  two  categories  is 
always  greater  than  the  cutoff  point  for  a  single  category^ 

(b)  The  cutoff  point  for  joint  retrieval  does  depend  on  the  prob¬ 
ability  distribution  of  documents  within  the  categories. 


U.^.l.U  Possible  Generalizations  -  Generalizations  to  the 
method  of  retrieval  described  here  may  proceed  in  either  of  two  direc¬ 
tions.  The  first  direction  is  to  extend  the  method  to  handle  Boolean 
combinations  of  descriptors  other  than  the  conjunction  of  two  descriptors 
the  second  generalization  is  to  consider  the  more  realistic  situation 
where  the  probability  distributions  of  documents  within  different  cate¬ 
gories  are  not  independent. 
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The  extension  of  probabilistic  retrieval  to  the  more  general 
Boolean  functions  appears  to  be  a  laborious  but  straightforward  mathe¬ 
matical  task.  It  has  not  appeared  worthwhile  actually  to  carry  out  this 
extension.  However,  on  the  basis  of  the  results  already  presented  it 
would  seem  reasonable  to  expect  that  the  cutoff  point  for  a  more  com¬ 
plicated  retrieval  would  depend  on  the  form  of  the  retrieval  and  on  the 
ensemble  of  distributions,  but  not  on  the  particular  descriptors  involved. 


A  considerable  amount  of  effort  was  expended  in  attempting  to 
analyze  the  situation  for  the  case  of  dependent  categories.  Unfortunately, 
it  appears  that  this  problem  is  insoluble .  The  remainder  of  this  section 
will  discuss  the  reasons  for  this  conclusion. 


The  case  of  dependent  categories  is  a  generalisation  of  the  case 
of  independent  oategories.  One  theoretically  possible  but  impraotioal 
solution  would  be  to  compute  the  joint  distributions  (p^,  p^)  for  each 
(i,  j)  pair  by  actually  counting  the  appropriate  numbers  of  documents, 

If  values  of  and  Pj  are  computed  in  increments  of  8,  then  this  would 

2  .  2 

fl  **  S  i  “ 

require  keeping  — w-  separate  statistics »  (i)  for  each  oategory  and 

<r  ,  » 

2 

(s  -  s)  times  that  number  for  all  possible  pairs  of  s  distinct  categories. 
Similar  statistics  would  be  required  for  p^^,  p^).  Therefore,  one 
would  hope  to  find  a  single  measure  of  relatednasa  between  categories 
and  to  use  this  measure  in  two  different  relationships:  one  that  would 
express  p^Cp^,  p^)  terms  of  and  p^  in  a  convenient  functional 
form;  and  the  other  that  would,  express  KL^p^,  p..)  in  terms  of  N^(p^) 
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and  I'L  (p ^ ) ,  The  assumption  of  independence  led  to  the  relations  in 
Equations  (168)  and  (169),  -which  accomplished  this  aim. 

It  is  possible,  and  perhaps  even  reasonable,  to  assume  that 
Pij(Pi#  Pj)  "  PiPj  and  to  incorporate  the  effects  of  dependence  between 
categories  into  the  distribution  function  alone.  The  rationale  for 
this  prooedure  is  as  follows:  suppose  that  the  distribution  statistics 
are  based  on  the  results  of  having  documents  assigned  to  categories  by  a 
panel  of  users.  If  two  categories  are  highly  dependent — for  example, 
almost  synonymous  —then  one  would  expect  that  those  documenta  ttat  have 
a  high  probability  of  belonging  to  one  oategory  also  have  a  high  proba¬ 
bility  of  belonging  to  the  other.  A  similar  rationale  holds  for  docu¬ 
ments  that  have  a  low  probability  of  belonging  to  one  or  the  other  of 
these  categories.  This  effect  would  manifest  itself  as  a  skewness  in 
Nij(pl#  Pj) •  However,  oonsider  a  single  document  that  had  been  assigned 
to  oategory  i  by  p.  of  the  users  and  to  oategory  j  by  p.  of  the  users . 

v 

Even  if  the  categories  are  closely  relatod  in  the  sense  that  documents 
belonging  to  one  are  likely  to  belong  to  the  other,  the  judgments  of  a 
parti  cular  panel  member  with  respeot  to  the  two  categories  may  well  be 
independent.  For  instance,  suppose  that  two  categories  are  closely 
related,  and  a  particular  document  is  assigned  to  each  of  them  with  a 
probability  of  90  percent.  It  need  not  be  true  that  the  90  percent  of 
the  users  who  assigned  the  document  to  the  first  category  are  the  same 
90  percent  as  those  who  assigned  it  to  the  second  category.  It  could 
reasonably  be  assumed  that  the  two  groups  are  in  fact  selected  independ¬ 
ently,  so  that  only  81  percent  of  the  users  assign  the  document  to  both 
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categories.  If  we  make  this  assumption,  then  we  can  take  p  (p  p  )  ■  p.p  , 

ij  i  j  i  3 

However,  the  problem  of  N.  remains , 


The  type  of  relationship  we  are  looking  for  should  be  of  the 


formi 


V^1  V  ■  FCpi-  pj-  Vi’'  V  <212> 

where  k^  is  a  parameter  that  measures  the  relatedness  of  category  i  to 
category  j .  If  we  do  not  assume  that  ■  k^,  then  k^  would  measure 
the  tendency  of  items  in  j  to  belong  to  i  also,  and  oonversely  for  k^. 
That  this  situation  can  in  fact  arise  is  illustrated  by  the  case  of 
nested  categories j  every  document  that  belongs  to  the  suboategory  also 
belongs  to  the  larger  category,  but  not  oonversely. 

Let  us  consider  the  constraints  on  the  expression  for  as 
given  in  Equation  (212).  Since  represents  a  distribution  function, 
we  must  havei 


nij(pi-  p3<  V  *  0 


(213) 


where  n^Pj,,  p^,  k^)  is,  as  before,  the  joint  probability  density 
defined  by  Equation  ( 16? ) >  with  k^  as  a  parameter.  Since  every  docu¬ 


ment  belongs  to  category  j  with  some  probability  between  0  and  1,  we 
have! 

1 
o 

and  similarly, 


B  J  Pj,  ■  n3(Pj) 


(2D.) 


i  p  ^ 

B  J  Vp±’  pj’  k«)dP,i  '  ni(pi> 


(215) 
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If  we  define  1c  ^  B  0  to  be  the  case  of  independent  categories,  then  we 
must  have: 

n,  (p. )  n  (p.) 

Pj*  0)  -  -  K  J  J  (216) 

Finally,  If  we  define  k^  *  1  to  indicate  synonymous  categories,  then  we 
will  want 

n13<%-  PJ>  k13>  hj  <217> 

The  reasoning  behind  this  equation  is  that  for  synonymous  categories 

/  Pj,  since  every  document 
will  be  assigned  to  the  two  categories  with,  the  same  probability.  Since 
n^j  will  be  non -aero,  only  along  the  line  p^  ■  p^  and  sinoe  this  line  has 
zero  area,  the  density  function  on  the  line  must  be  infinite  if  the 
integral  of  the  density  is  to  be  non-zero .  This  situation  is,  however, 
approached  only  in  the  limit:  hence  we  have  Equation  (217). 

A  careful  examination  of  the  forms  that  n^j  might  take  has  led 
to  the  conclusion  that  there  is  no  reasonably  simple  n^  that  oan  be 
found}  and  if  n^  is  too  complicated,  it  will  be  impossible  to  carry¬ 
out  the  remainder  of  the  analysis,  which  was  difficult  enough  even  in 
the  independent  case.  The  two  most  likely  forms  were 

ni3(pi’  p3'  V  ■  ni(pi>  W  ftpl’  p3’  V  (2l8> 

and 

n.^. )  n.(p.) 

-13(pi’  p3-  V  ‘  1  N  il  *  *13  f(pi’  p3*  lt13)  (2W) 

We  will  consider  Equation  (218)  first. 


the  density  function  n^ .  will  be  aero  for  p^ 
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For  Equation  (218)  the  constraints,  Equations  (213)  through 


(217),  yield: 

f(Pi,  Pj,  Ky)  a  0  (220A) 

I  Jo  ni(Pi> VV  f<Pi*  pj.  ■  VV  <220B> 

1  .1 

jj  J  J\(P4)  nj(Pj)  f^pi»  pj»  ki^dpj  "  ni(Pj.)  (220C) 

f(P±,  Py  0)  -  1  (220D) 

f(Pti  Pj»  Kj_j)  -  »  as  -  1  (220E) 


It  is  apparent  that  f  should  be  symmetric  in  and  p^.  Furthermore, 
from  Equation  (220B)  we  see  that  the  quantity, 


J0  ”i(pi)  f(Pi*  p3>  Vdpi  (221> 

must  be  invariant  for  all  possible  n^(p^) .  Sinoe  ^(p^)  is  an  arbitrary 
positive  function  of  p^,  we  must  haves 


f(P i*  »j»  V 


.  p.i»  hp 


(222) 


to  oanoel  out  the  effect  of  varying  n^(p^) .  By  symmetry,  however,  we 
must  alBo  haves 


ftp*,  Pj.  ky) 


p.i> 

— 


(223) 


Since  f#  must  be  the  same  in  Equations  (222)  and  (223),  we  have  a  con¬ 
tradiction  and  Equation  (210)  must  be  discarded, 


If  we  try  Equation  (219),  the  constraint  equations  yield 
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(22liA) 


ki.i  f(V  P3>  ki3)  5  X'  1 


n,(p,)  n,(p.) 


T 


JL-JL 


p  -L 

Jo  f^i>  Pj»  ki^dPi B  0 

p  i 

J  fCPi,  pv  ki.1)dPi  "  0 


(221(B) 

(22I4.CJ) 

(22liD) 


f(Pj*  Pj»  ki;j)  -  ~  as  -  1 

We  have  a  similar  difficulty.  If  f  contains  a  multiplicative  factor  of 
n^(p^)  n^(Pj),  then  we  can  remove  n^p^)  (p^ )  from  and  use  the  same 

argument  as  the  one  raised  against  Equation  (218).  Yet  without  this 
factor  there  does  not  appear  to  be  any  way  to  satisfy  Equation  (22i|B)  in 
view  of  the  arbitrary  nature  of  n^Cp^)  and  n^(pj). 


What  we  have  shown  is  that  there  does  not  appear  to  be  any 
possibility  of  developing  an  analysis  of  probabilistic  retrieval  that 
will  aooount  for  the  relatedness  of  categories  used  in  a  query.  However, 
for  most  retrieval  requests  encountered  in  praotioe  it  would  be  reasonable 
to  expect  that  different  categories  mentioned  in  the  request  would  be  at 
worst  slightly  related.  Furthermore,  a  well-chosen  set  of  categories 
will  probably  have  little  correlation  among  its  members  since  the  exist¬ 
ence  of  correlation  degrades  the  utility  of  the  categories.  In  summary, 
then,  the  use  of  the  independence  assumption  should  not  unduly  distort 
the  results  of  probabilistic  retrieval. 


H.^.1.5  Conclusions  -  It  is  now  possible  to  outline  the 
general  features  of  a  probabilistic  retrieval  system.  To  each  category 


there  will  correspond  a  collection  of  classes  of  documents  instead  of  a 


unique  class  of  documents.  Each  class  will  be  determined  by  a  different 
cutoff  point  o.  For  each  document,  there  will  be  two  types  of  cutoff 
points,  disjunctive  and  conjunctive.  Within  each  of  these  categories  an 
individual  a  will  have  its  value  determined  in  accordance  with  the  type 
of  joint  retrieval  It  is  scheduled  to  participate  in.  Thus  there  will  be 
one  outoff  point  for  the  conjunction  of  two  descriptors,  another  for  con¬ 
junction  of  three,  eto.  The  same  prinoiple  holds  for  the  outoff  points 

for  disjunctive  retrievals.  Any  incoming  request  will  be  transformed 

j 

into  convenient  canonical  form)  for  example,  &  disjunction  of; conjunctions . 
The  appropriate  outoff  points  will  then  be  selected  and  retrieval  effected. 

In  order  to  oaloulate  the  outoff  points,  certain  parameters  are 
required.  These  parameters  can  be  obtained  by  requiring  the  system,  to 
perform  bookkeeping  operations  that  will  supply  the  required  data. 
Essentially,  the  kind  of  statistical  data  neoessary  for  the  calculation 
of  the  outoff  points  isi 

(a)  n.  (p)  -  the  density  of  do ouments  pertaining  to  a  given  desorip- 

1  tor  for  a  given  probability  interval. 

(b)  p,  (a)  n  the  average  probability  value  of  a  document  belonging 

1  to  the  descriptor  i  as  a  function  of  a  cutoff  point. 

(o )  N.(c)  ■  the  total  number  of  documents  belonging  to  the  descrip¬ 
tor  i  as  a  function  of  a 

The  most  fundamental  of  the  three  types  of  data  is  (a),  since  (b)  and 

(c)  can  be  calculated,  from  it. 

The  Problem  of  Redundancy 

U.5.2.1  Introduction  -  Redundancy  in  the  information  retrieval 
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processes  occurs  whenever  the  retrieved  data  is  duplicated.  To  avoid 
redundancy  is  important,  not  only  Tor  the  rather  obvious  economic  reason, 
but  also  for  operational  and  logical  reasons .  Theoretical  considerations 
pertaining  to  the  nature  of  measures  for  removing  redundancy  will  be  best 
understood  within  the  context  of  a  more  detailed  discussion  of  the  unde¬ 
sirability  of  duplication  from  these  three  points  of  view. 

ll.£>.2,2  Economic  Point  of  View  -  For  some  types  of  information 
retrieval  systems  the  oost  of  retrieval  may  become  prohibitively  high, 
especially  if  all  the  data  pertaining  to  the  request  profile  is  retrieved. 

The  use  value  of  the  information  contained  in  the  retrieved 
data  may  be  drastically  reduced  by  the  existence  of  redundant  material. 
Effectively  the  user  of  the  data  is  swamped  fey  repetitious  information, 

U.5.2.3  Operational  Point  of  View  -  Many  information  retrieval 
systems  enter  into  larger  systems  as  component  units.  The  retrieved  data 
may  form  an  input  to  other  processes  such  as  control,  command  and  control, 
or  real-time  monitoring.  The  occurrence  of  redundant  material  may  not 
only  reduce  the  efficiency  of  the  functioning  of  the  system,  but  also 
affect  the  outcome  of  the  processes  to  which  the  retrieved  data  forms  an 
input.  For  example,  Imagine  a  system  that  is  required  to  perform  some 
statistical  tabulations  on  the  incidence  of  car  accidents  among  various 
population  groups.  Furthermore,  assume  that  the  reports  on  automobile 
accidents  are  incoming  from  diverse  sources  go  that  some  accidents  may 
bo  reported  more  than  once.  Under  such  conditions  it  will  be  necessary, 
in  order  to  obtain  valid  results,  to  introduce  some  filtering  stage  that 
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will  prevent  or  eliminate  duplication.  Estimates  of  the  reliability  of 
tho  results  obtained  will  in  general  depend  upon  the  effectiveness  of  the 
filtering  stage.  The  removal  of  data  redundancy  is  thus  vital  to  the 
satisfactory  performance  of  the  system  as  a  whole. 

U .^.2 .U  Logical  Point  of  View  -  In  the  process  of  decision 
making  the  origin  of  the  data  may  be  as  relevant  to  the  decision  as  its 
content.  It  is  even  conceivable  that  tho  existence  of  large  amount  of 
redundancy  in  the  collected  data  may  be  one  of  the  important  factors 
influencing  the  nature  of  the  decision.  In  other  words,  the  deoision 
process  may  be  dependent  on  the  manner  in  which  the  data  is  presented. 

As  an  example,  imagine  a  aystem  whose  task  it  ie  to  aolve  transportation- 
routing  problems ,  The  kind  of  solution  employed  may  well  depend  upon  the 
complexity  of  a  particular  problem.  If  the  particular  transportation 
network  contains  many  nodes,  the  system  will  use  one  type  of  an  algorithm} 
if  it  oontains  few  nodes,  then  another. 

Determining  the  nature  of  the  problem  may  depend  upon  sampling 

of  dataj  thus  lnacouraoies  will  arise  if  the  data  contains  a  large  amount 

.1 

of  redundancy.  Such  a  situation  is  particularly  prone  to  arise  if  the 
system  schedules  its  own  operations  and  batches  many  problems  together. 

h. 5>. 2. $  Tentative  Measures  of  Redundancy  -  Considering  several 
ways  in  which  the  concept  of  redundancy  is  implicated  in  the  information 
retrieval  processes,  a  basio  dichotomy  becomes  apparent: 

(a)  Some  of  the  redundancy  problems  require  the  exact  scrutiny  of 

the  individual  data  items.  If  data  items  are  conventionally 
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thought  of  as  documents,  then  a  sort  of  redundancy  map  could 
be  obtained  by  indicating  the  relationship  with  respect  to 
redundancy  of  each  document  to  every  other  document  in  the  col- 
lection.  The  simplest  kind  of  relation  between  documents  with 
respect  to  redundancy  is  that  of  inclusion]  that  is,  one  doo- 
-orient  may  express  everything  that  another  document  expresses 
with  respect  to  a  given  topic.  Another  possible  relation, 
although  a  lass  simple  one,  is  that  of  overlap.  A  document  may 
partially  express  the  content  of  another  document  with  reapeot 
to  a  given  topio  with  some  numerical  measure  of  the  partial 
covering . 

(b)  It  may  be  possible  or  desirable  to  handle  the  problem  of  reduo- 
ing  redundancy  on  an  aggregate  level.  The  distinguishing 
feature  of  this  approaoh  is  the  statistical  handling  of  infor¬ 
mation  contained  in  the  documents.  It  is  important  to  remem¬ 
ber  that,  since  the  primary  concern  is  redundancy,  the  basio 
measure  of  information  must  be  relative  rather  than  absolute. 
That  is,  suoh  a  measure  when  applied  to  a  document  should  be 
able  to  determine  the  expected  number  of  documents  rendered 
superfluous  by  the  document  in  question]  alternatively,  the 
measure  should  indicate  how  many  documents  render  a  given  doo- 
umsnt  superfluous. 

Usually  a  document  will  cover  a  number  of  topics.  In  general, 
it  must  ba  expected  that  the  redundancy  measure  will  not  be 
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evenly  distributed  among  all  the  topics  that  a  given  document 
deals  with.  Thus  with  respect  to  one  topic  a  document  may  be 
Mg  lily  unique,  whereas  with  respect  to  another,  Highly  redundant. 
Whether  or  not  It  is  advisable  to  average  the  redundancy  meas¬ 
ure  over  all  topics  or  handle  them  separately  is  a  question  that 
may  be  decided  only  after  a  more  detailed  and  rigorous  study. 

It  is  also  possible  that  this  question  admits  no  unique  answer, 
sinoe  information  retrieval  systems  are  highly  differentiated 
with  respect  to  their  functional  characteristics. 

It  would  bo  incorrect  to  assume  that  this  dichotomy  represents 
two  alternative  approaohas,  It  is  quite  unrealistic  to  expeot  that  an 
exhaustive  redundancy  map  comprising  the  detailed  breakdown  of  all  rela¬ 
tions  among  all  documents  Individually  is  feasible.  Practloally,  some 
sort  of  statistical  approaoh  is  necessary.  It  is  necessary,  however,  to 
demand  that  any  statistical  averages  employed  to  reduce  redundancy  capture 
the  true  statistical  properties  of  a  system  based  upon  the  requirements 
for  a  redundancy  map. 

U.5.2.6  Conclusion  -  It  is  important  to  avoid  redundancy  for 
operational,  logical,  and  economic  reasons,  Two  tentative  examples  of 
redundancy  measures  ares 

(a)  Each  document  is  characterized  by  a  set  of  numbers  expressing 
the  percentage  of  documents  containing  more,  or  less,  informa¬ 
tion  concerning  a  given  topic. 

(b)  Each  document  is  characterized  by  a  set  of  numbers  expressing 
the  additional  contribution  that  the  document  would  make  to  the 
given  topic,  assuming  the  average  number  of  documents  already 
retrieved. 


h.r)c)  Adaptation  to  User  Requirements 

Uaor  Orientation  -  The  users  of  an  information  system 
are  often  conceived  as  a  univocal  mass  that  knows  precisely  what  type  of 
information  it  wants  from  the  system.  The  problem  of  system  design  is 
then  reduced  to  the  simple  expedient  of  devising  means  of  access  to  the 
general  body  of  stored  information  for  this  class  of  users. 

In  fact,  however,  the  users  are  neither  univocal  nor  certain; 
if  they  were,  the  problem  of  information  retrieval  would  be  vastly 
simplified.  Any  intermediary  for  gaining  access  to  stored  information 
would  be  superfluous,  since  the  users  by  definition  have  a  priori  knowl¬ 
edge  about  the  nature  of  the  information  they  seek.  The  difficulty  is 
that  users  approach  any  information  system— even  a  library  card  catalogue — 
because  their  questions  are  vague  and  ill  formed.  Furthermore,  each  user 
wishes  to  fulfill  a  different  need. 

In  confronting  a  new  system,  any  user  is  wary  at  firstj  the 
mechanism  of  the  system  stands  as  a  barrier  (and  possibly  a  threat)  between 
his  questions  and  whatever  answers  may  be  available.  The  first  criterion 
for  gaining  the  user's  confidence,  then,  is  simplicity;  the  mechanics  of 
the  system  should  be  readily  grasped  after  a  few  moments  of  study.  The 
second  criterion  is  that  the  user  quickly  gain  confidence  that  the  sys¬ 
tem.  can  indeed  produce  reasonable  responses  to  reasonably  well  formed 
queries. 


This  second  factor  poses  the  greatest  difficulty.  If  a  user  has 
confidence  in  the  system,  he  is  willing  to  enter  a  tacit  dialogue.  A 
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simple  question,  however  ill  formed,  produces  sufficient  information  to 
load  to  another,  more  cogent  question.  The  dialogue  continues  from  ques¬ 
tion  to  answer  to  question  until  the  user  eventually  frames  precisely 
the  right  question  to  gain  access  to  the  information  he  originally  sought. 
This  process  with  the  familiar  card  catalogue  ic  heuristic]  the  same 
process  should  occur  with  an  automated  system,  but  the  interposition  of 
a  machine  may  easily  restrain  the  facility  of  the  dialogue. 

An  information  system  deals  with  the  functional  elements  of 
information  in  such  a  way  that  a  sequence  of  operations  upon  these  ele¬ 
ments  or  upon  concatenations  of  these  elements  produoes  the  requested 
information.  What  is  desired  is  information  explicitly  or  implicitly 
oontalned  in  the  data  reoeived  by  the  system.  Thus,  ultimately,  logical 
implications,  generalisations,  correlations,  and  even  logical  appraisals 
of  the  original  data  (oredulity  measures  and  ordering  relations)  may  be 
the  results  of  these  operations. 

The  requirements  for  performing  operations  upon  the  Information 
parallel,  at  least  in  part,  those  for  storing  information.  These  opera¬ 
tions  should  be  defined  so  that  information  can  be  reoombined  into  forms 
that  are  not  explicitly  formed  in  the  original  information.  Such  process" 
ing  operations  should  be  specified  in  relation  to  the  storage  operations. 
The  retrieval  processes  may  then  gather  relevant  material  from  the  stored 
data  so  that  it  may  be  operated  upon  and  used  to  answer  questions.  Some 
of  these  operations  are  based  upon  statistical  analyses  of  the  data. 

Other  operations  are  functions  performed  upon  the  question  in  order  to 
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improve  the  formulation  of  a  query.  In  this  way  the  inherent  difficulties 
in  establishing  a  dialogue  between  the  user  and  the  system  may  be  reduced, 
if  not  ontirelv  eliminated. 

Additional  operations  on  information  may  be  necessary,  The 
system  may  be  expected  to  derive  logical  relationships  existing  among 
data  contained  in  its  memory.  In  addition  to  logical  inferences  (deduc¬ 
tions),  the  system  may  be  expeoted  to  perform  inferential  processes 
(inductions) .  Such  inductive  inferences  differ  from  deductive  inferences 
in  two  important  respects:  the  relationships  derived  are  not  neoesa&rily 
valid}  and  not  all  the  rules  of  inductive  reasoning  are  explicitly 
formalized. 

.Implied  relationship  is  a  generic  term  for  all  relationships 
not  explicitly  contained  in  a  system.  Such  relationships  are  derived  by 
means  of  inferential  processes }  that  is,  InduotionB  and  statistical 
correlations .  The  term  implied  relationship  inoludes  relationships 
derived  on  the  basis  of  induotive,  or  non-rigorous,  inferential 
prooesses.  Such  relationships  are  by  their  nature  not  as  well  defined 
as  relationships  obtained  deductively,  The  system  must,  therefore,  be 
designed  with  the  capacity  to  estimate  the  degree  of  credibility  of  such 
derived  relations  and  ths  degree  of  relevance  to  other  information.  On 
the  basis  of  such  estimates  the  system  may  accept  or  reject  the  derived 
conclusions. 

Since  the  set  of  implied  relationships  is  not  well  defined, 
such  a  system  will  arbitrarily  limit  the  range  of  derivable  relationship's . 


172 


cannot  be  expected  that  the  system  will  attempt  to  derive  all  the  implied 
relationships  that  lie  within  a  specified  range  without  being  requested 
to  do  so,  either  directly  or  indirectly,  in  terms  of  a  question.  On  the 
other  hand,  some  of  the  implied  relationships  might  be  so  important  to 
the  functioning  of  the  system  that  they  ought  to  be  derived  even  without 
any  initiating  query.  An  information  system  would,  therefore,  be  more 
powerful  if  it  possessed  a  set  of  decision  algorithms  for  determining  at 
which  point  it  must  stop  its  inferential  activities* 

It  is  necessary  to  state  the  criteria  employed  to  seleot  the 
relationships  the  system  will  derive.  While  the  set  of  explioit  rela¬ 
tionships  stored,  in  the  memory  of  a  system  may  be  well  defined,  the  cor¬ 
responding  set  of  Implicit  relationships  may  not  be.  Tho  derived  implicit 
relationships  depend  not  only  upon  the  set  of  explicit  relationships,  but 
also  the  nature  of  tho  formal  or  informal  inferential  methods  as  well  as 
upon  other  factors— for  example,  the  riohness  of  association— less  amenable 
to  precise  description.  Because  of  these  faotors  it  may  be  questioned 
whether  the  notion  of  the  set  of  all  implioit  relationships  derivable  from 
the  information  Is  meaningful.  From  a  practical  viewpoint,  some  limita¬ 
tions  upon  the  range  of  implioit  relationships  must  be  imposed. 

The  criteria  for  the  limitations  that  are  to  be  imposed  upon  a 
system's  ability  to  derive  implicit  relationships  ought  to  include: 

(a)  Only  implicit  relationships  possessing  potential  utility  to 

the  users  of  the  system  should  bo  derived. 

(b)  The  system  should  not  try  to  derive  implicit  relationships  of 

so  complex  a  nature  that  the  attempt  is  likely  to  end  in  failure. 
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(c)  The  limitations  should  be  flexible  enough  to  leave  room  for 
learning. 

The  system  may  be  able  to  increase  the  range  of  derivable  implicit  rela¬ 
tionships  as  it  obtains  more  input  information  or  elicits  more  information 
about  a  question  from  the  user;  again  the  importance  of  a  dialogue  is 
apparent.  The  criterion  for  the  selection  of  derivable  relationships, 
which  includes  all  three  of  these  characteristics  iss  the  system  is  only 
concerned  with  those  implied  relationships  that  can  be  derived  in  response 
to  a  definite  procedure  specified  by  the  user.  This  principle  may  be 
considered  as  the  organizing  principle  of  the  system. 

There  are  several  points  that  will  clarify  the  moaning  of  this 
principle.  In  addition,  the  adoption  of  this  principle  has  oertain 
implications  for  the  learning  processes  that  will  take  place  in  an 
information  system.  The  phrase,  " . in  response  to  a  definite  pro¬ 

cedure  specified  by  the  user,"  does  not  mean  that  the  user  is  obliged 
to  supply  the  directives  that  could  be  directly  translated  into  programs— 
that  is,  a  sequence  of  action  resulting  in  an  output  consisting  of  the 
appropriate  Implicit  relationships.  Neither  does  it  mean  that  such  a 
specification  need  be  supplied  to  the  system  initially. 

The  principle  simply  states  that  the  user  knows  how  to  go  about 
soJ.vi.ng  the  problem  embodied  In  a  query  addressed  to  the  system;  ho 
knows  how  to  solve  the  problem  in  terms  of  human  mental  processes. 
Moreover,  the  principle  does  not  require  the  user  to  state  the  procedure 
formally.  The  concept  of  knowing  how  to  go  about  solving  problems  implios 
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no  more  than  that  the  user  know  enough  about  his  own  procedures  to  answer 
questions  about  his  approach  to  the  problem . 

U.5.3.2  A  Concept  of  Questioning  -  In  order  to  optimize  the 
retrieval  ability  of  a  system,  the  user  should  question  the  system  within 
the  framework  of  a  theory  of  questioning.  The  development  of; a  conoept 
of  questioning  has  occasioned  considerable  scientific  interest  within  the 
last  decade.  In  part,  such  an  interest  is  related  to  problems  of  retriev¬ 
ing  information,  for  even  a  cursory  examination  of  questioning  indicates 
that  it  plays  an  important  role  in  the  retrieval  of  information ,  Every 
pragmatically  important  question  has  a  correot  answer  associated  with  it. 
Suoh  a  oorrect  answer  is  a  statement  that  provides  a  person  with  informa¬ 
tion-knowledge  that  he  did  not  possess  at  the  time  that  ho  asked  the 
question.  The  statement  may  be  true  or  false  and  still  fulfill  this 
criterion.  Given  a  framework  of  this  kind,  the  concept  of  questions 
requires  a  development  along  two  parallel  lines:  the  semeiology  and 
the  methodology  of  questions, 

The  semeiology  of  questions  pertains  to  the  form  and  nature 
of  queries.  Questions  are  a  type  of  linguistic  structure.  Composed  as 
they  are  of  signs — letters  and  words— questions  have  moaning.  Such  mean¬ 
ing  may  be  even  more  complex  than  the  meaning  of  declarative  statements, 
since  questions  may  also  be  logical  functions  of  such  meanings. 

There  are  two  possible  ways  to  investigate  the  meaning  of  a 
question.  A  question  may  be  correlated  with  a  class  of  statements,  any 
one  of  which  is  a  correct  answer  to  the  question.  In  this  sense,  the 
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question  defines  the  scope  of  possible  answers j  it  is  neither  responsive 
nor  meaningful  to  answer  the  question,  "What  time  is  it  now?"  with  the 
statement,  "The  Parthenon  is  located  in  Athens,  Greece."  On  the  otter 
hand,  there  are  questions  that  do  not  define  the  kind  of  statement  that 
is  a  correct  answer.  Consider  the  question,  "How  many  horns  does  a  unicorn 
have?"  "There  are  no  such  things  as  unicorns,"  is  as  correct  an  answer 
as,  "A  unicorn  has  one  horn."  In  other  words,  a  question  may  pragmatically 
admit  unclarity  about  the  boundaries  of  a  subject.  Only  procedurally 
correct  questions  request  information  within  a  framework  of  concepts  and 
statements  accepted  as  true  by  both  the  questioner  and  the  informer. 

The  realization  that  a  question  is  related  to  a  given  state  of  ■ 
knowledge  requires  further  exploration.  It  is  clear  that  a  question  is 
meaningful  only  if  the  questioner  refers  to  a  set  of  interrelated  concepts 
either  explicitly  or  implioltly.  When  a  questioner  asks,  "What  time  is 
it?"  he  knows  that  the  answer  is  a  set  of  numbers  that  have  a  certain 
order — for  example,  "later  than."  But  it  remains  a  problem  whether  some 
concept  must  be  assumed  explicitly  or  implicitly  for  any  question  to  bo 
meaningful.  It  may  be  that  in  order  for  a  question  to  be  meaningful, 
some  restriction  of  its  scope  must  be  present. 

The  meaning  of  a  complex  term  is  not  only  determined  by  its 
relationship  to  non -linguistic  factors,  but  also  by  its  logical  rela¬ 
tions  lap  to  other  terms.  The  meaning  of  questions  is  in  part  specified 
by  their  logical  or  syntactical  relationship  to  other  questions.  What 
is  required,  then,  is  a  formal  logic  of  questions.  Such  a  logic  would 
rigorously  formulate: 
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(a)  The  syntax  of  a  formal  language  into  which  questions  in  natural 
language  are  translatable. 

(b)  The  rul.es  of  deduction  for  such  a  language. 

(c)  The  theorems  concerning  logical  relations  formulatablc  in  such 
a  system. 

It  seems  that  the  language  in  which  the  logic  is  formulated  may  be  con¬ 
structed  out  of  declarative  sentences  by  the  use  of  an  undefined  logical 
operator  [28,  29].  Logical  functions  analogous  to  deduotlon  oan  then 
be  defined.  In  any  system  the  correlation  between  questions  and  permis¬ 
sible  answers  must  be  formally  modeled  by  mapping  a  question  on  a  set  of 
sentences.  Semantically,  at  least,  the  range  of  variables  should  also 
be  specified  for  answers  that  are  specifiable  for  standard  types,  of 
questions . 

In  addition  to  logloal  deducibility  that  would  be  studied  by 
auoh  a  calculus,  there  is  another  dimension  of  logioal  analysis.  This 
area  pertains  to  the  relative  complexity  of  questions.  It  may  be,  for 
example,  that  in  a  certain  context  a  Why  question  is  translatable  into  a 
finite  set  of  How  questions.  In  this  context,  Why  questions  are  more 
complex  than  How  questions,  But  there  are  many  'types  of  questions.  In 
addition,  there  are  disjunctive  and  conjunctive  questions  as  well  a3 
general  and  particular  questions .  This  brief  discussion  indicates  that 
a  logical  theory  is  necessary  to  consider  problems  of  this  kind 
systematically . 

Once  a  formal  analysis  of  questions  has  been  developed,  it  will 
provide  insight  into  the  methodology  of  questions.  If  the  questions'  that 
imply  other  questions  are  known  or  are  reducible  to  other  questions,  then 


it  is  easier  to  develop  strategies  for  sequencing  questions  so  as  to 
obtain  maximum  information  for  a  minimum  set  of  questions.  It  is  advan¬ 
tageous  for  any  information  processing  system  to  allow  this  condition  to 

be  fulfilled. 

Besides  purely  logical  and  formal  considerations,  there  is  a 
problem  of  methodology — the  strategy  or  heuristic  of  interrogation.  This 
problem  centers  on  the  problem  of  efficiency  and  purposefulness  in 
interrogation.  The  main  objective  is  to  relate  the  formal  characteristics 
of  questioning  to  intentions  that  the  questioner  may  have.  Prom  the  nature 
of  the  problem  it  is  evident  that,  unlike  the  inquiry  into  formal  prop¬ 
erties  of  questions,  this  disouaaion  is  mainly  conoemed  with  sequences 
of  questions. 

There  are  two  types  of  goals  that  can  be  associated  with  the 
procedure  of  interrogation.  The  first  is  the  desire  to  obtain  more  factual 
information,  ft.  simple  example  of  this  type  of  interrogation  is:  "How 
many  people  reside  in  Rome?"  The  second  goal  is  to  obtain  a  better 
understanding  of  a  oertain  area  of  inquiry.  This  objective  may  he 
related  to  the  interrogator's  perception  of  gape  in  the  flow  of  infor¬ 
mation  or  to  his  lack  of  understanding  of  the  information.  Efficient 
and  intelligent  questioning  depends  upon  the  precision  with  which  the 
interrogator  can  pinpoint  the  kind  of  information  he  wants  as  well  as 
upon  his  ability  to  formulate  the  appropriate  sequences  of  questions. 

The  objective  of  this  concept  of  questioning  is  to  establish 
procedures  for  an  interrogator  to  discern  the  intention  of  his  interogations . 
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The  concept  is  not  psychologically  oriented.  The  problem  is  not  to 
correlate  subjective  states  of  mind  with  the  objective  elements  of  the 
questioning  process.  The  concept  seeks  to  associate  the  properties  of 
sets  of  information  with  the  rational  formulation  of  interrogative 
intentions .  These  intentions  are  then  fulfilled  if  the  sequence  of 
questions  is  appropriate  for  its  purpose. 

The  ordering  and  the  retrieval  of  information  depend  upon 
initially  specified  rules  for  information  handling.  Those  rules  may 
not  be  the  only  rules  for  data  handling  necessary  for  the  proper  and 
efficient  operation  of  an  information  system.  The  system  must  be  able 
to  acquire  new  rules  and  modify  old  rules  as  it  continues  to  process 
information.  The  acquisition  of  rules  may  be  divided  into  two  categories. 

One  category  Includes  processes  based  upon  suooess-failure 
criteria.  In  processes  of  this  kind  an  information  system  attempts  to 
improve  its  performance  without  an  interchange  of  complex  questions  with 
the  user.  If  the  oriteria  for  adequate  performance  are  not  satisfied, 
the  system  seeks  to  improve  its  performance  solely  on  the  basis  of  its 
store  of  data  and  its  own  experience. 

The  second  category  includes  processes  based  upon  a  system's 
attempt  to  elicit  information  pertinent  to  the  formation  of  adequate 
processing  rules  from  the  user.  Such  processes  are  more  complex  than 
those  in  the  first  category.  In  addition  to  being  able  to  use  its  own 
experience,  the  system  is  able  to  question  human  beings  and  to  vise  human 
guidance.  In  this  way  the  essential  dialogue  between  a  user  and  a  system 
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may  lead  to  the  necessary  well  formed  questions  that  will  elicit  the 
required  information  for  the  user. 

The  implication  of  this  discussion  is  that  the  user-system 
dialogue  will  necessarily  span  a  range  of  questions  over  a  period  of 
time,  however  short  the  time.  But  this  implied  constraint  need  not 
follow.  A  simple  question  may  'be  simply  answered;  yet  in  a  simple 
question  the  necessary  clues  to  the  relevant  information  are  almost 
apparent.  Consider  a  slightly  more  difficult  instance.  If  the  system 
oontains  N  categories  of  information,  then  Ni  question  combinations  are 
possible.  The  information  may  also  be  stored  so  that  a  relation  (A,B,C..,) 
holds.  The  query  may  be  framed  (C,B,A).  A  simple  response  would  .states 
"If  your  request  could  also  be  (A,B,C),  then  your  answer  is..."  This 
approach  appears  too  easy,  but  it  is  not  uncommon.  And  if  these  func¬ 
tions  were  automated,  the  demon  of  interrogation  could  be  greatly 
simplified, 

U.JJ.3.3  The  Linguistic  Problem  -  Given  an  appropriate  formal 
representation  of  linguistic  input,  there  still  exist  problems  of  equivoca¬ 
tion  in  word  use  that  would  disrupt  the  functioning  of  an  inferential 
processor.  Consider  the  following  true  assertions s 

(a)  The  number  2  is  rational, 

(b)  Socrates  is  rational, 

(c)  Anything  rational  can  reason. 

These  sample  sentences  have  little  inherent  interest,  Their  purpose, 
however,  is  paradigmatic  rather  than  practical. 
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The  word  rational  in  sentence  (a)  is  being  used  in  a  different 
sense  from  the  same  word  in  sentences  (b)  and  (c).  Unfortunately,  this 
difference  is  more  than  a  mere  linguistic  difficulty „  It  is  conceivable 
that  at  a  purely  linguistic  level  the  equivocation  is  irrelevant.  An 
example  is  translation  to  another  language  that  has  the  same  ambiguity 
in  the  use  of  the  word  rational.  In  the  context  of  accurate  inference, 
however,  this  kind  of  apparently  insignificant  linguistic  difficulty  can 
lead  to  serious  logical  problems.  Thus,  sentences  (a)  and  (c)  soem  to 
lead  to  the  oonolusion  that  the  number  2  oan  reason.  This  falsehood  is 
directly  attributable  to  the  fallacy  of  the  four-term  syllogism  produced 

ft 

by  the  equivocation  in  the  use  of  the  word  rational. 

For  any  deductive  inference  processor  an  awareness  of  such 
equivocation  is  essential.  Other  reaiarch  [1 »••«••• 
sense  value  theory  that  may  be  able  to  discover  such  distinctions  in 
sense  meohanioally.  For  the  purpose  of  inferential  processing  it  would 
be  desirable  to  establish  whether  sense  value  theory  may  be  applied  to 


It  is  possible  to  argue  that  the  difficulty  lies  not  in  the  equivocation 
in  the  use  of  "rational"  but  in  the  falsehood  of  sentence  (o),  given  such 
equivocation.  Perhaps  the  example  is  ill  chosen,  but  we  would  ordinarily 
allow  the  use  of  generalizations  suoh  as  (o)  provided  that  the  sense  of 
the  words  involved  is  clear.  Thus,  that  anything  that  is  heavy  (or  light) 
has  weight  seems  beyond  question.  The  reason  that  there  is  no  question 
is  that  it  is  clear  that  the  terms  heavy  and  light  are  being  used  in  the 
sense  of  weight.  Wo  are  not  led  to  reject  the  generalization  because 
colors,  for  example,  my  be  said  to  be  heavy  (awkward,  but  possible)  or 
light.  We  say  rather  that  colors  (as  opposed  to  pigments)  are  not  the 
sorts  of  things  that  have  weight  and  that  the  sense  in  which  "heavy"  or 
"light"  may  be  used  to  describe  them  is  quite  different  from  the  sense  in 
which  these  words  describe  relative  weight — even  though  there  is  a  meta¬ 
phoric  value  in  the  analogy  between  weight  and  color  demansions . 
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the  mechanical  discovery  of  sense  equivocations  in  practice.  Appendix  D 
presents  a  discussion  of  the  fundamental  concepts  of  sense  value  theory 
and  an  account  of  possible  approaches  to  the  application  of  .sense  value 
theory  to  inferential  processing,, 

lt.5o3oU  The  Logical  Problem  -  This  section  considers  the  devel¬ 
opment  of  inferential  capabilities,  given  a  mass  of  initially  linguistic 
data  reduced  to  an  appropriate  unequivocal  form  suitable  for  further 
machine  processing.  Two  kinds  of  inferential  problems  can  be  distinguished 
at  this  point* 

(a)  The  relatively  straightforward  problem  of  checking  whether  a 
conclusion  deductively  follows  from  the  information  in  the  file . 

(b)  The  more  difficult  problem  of  assessing  the  validity  of  a  gen¬ 
eralization  inductively. 

While  the  problem  of  lnduotive  inference  will  be  left  for  later  devel¬ 
opment  and  will  not  receive  imioh  further  consideration  in  this  section, 
it  should  be  noted  that  most  linguistic  information  files  are  probably 
far  too  complex  for  simple  deductive  processing  aohemes  to  be  effeotlvo 
in  regard  to  the  answering  of  many  kinds  of  questions. 

Among  the  difficulties  ve  may  expect  to  encounter  in  implementing 
automatic  deductive  processing,  two  are  especially  salient* 

(a)  A  great  deal  of  information  that  people  use  in  dovoloping  valid 
inferences  about  practical  matters  is  never  explicitly  stated 
in  a  textual  account  of  the  facts  concerning  some  matter  of 
interest. 

(b)  Textual  sources  may  contain  contradictory  assertions  that  render 
successful  deductive  processing  impossible  because  any  conclu¬ 
sion  may  follow. 
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The  second  difficulty  may  bo  regarded  as  an  instance  of  the  kind 
of  problem  that  only  inductive  systems  that  use  probabilistic  techniques 
for  weighting  and  significance  of  observations  or  assertions  can  overcome. 
For  the  purpose  of  this  discussion  the  second  kind  of  difficulty  is 
regarded  an  one  that  automatic  deductive  systems  should  be  able  to  detect 
while  leaving  correction  as  a  human  function.  The  former  difficulty, 
however,  will  be  a  serious  limitation  on  deductive  systems.  It  seems 
that  it  should  be  possible  to  work  on  this  problem  within  a  purely 
deductive  framework.  That  is,  the  problem  does  not  inherently  require 
inductive  techniques  suoh  as  probabilistic  weighting  or  generalization. 

An  example  may  help  clarify  the  last  conclusion.  The  human  being 
has  no  difficulty  concluding  from  the  faot  that  X  was  in  Chicago  all  day 
on  a  given  day,  that  ho  was  not  in  New  York  or  Los  Angeles  or  any  other 
different  p'JLaoe  on  the  given  oooaalon.  He  is  further  generally  able  to 
conolude  that  the  individual  in  question  was  in  Illinois  rather  than  that 
he  was  not  in  Illinois.  Our  hypothetical  oogitator  is  ablo  to  perform 
these  feats  of  inference  in  essentially  deduotive  fashion  by  appending 
to  the  assertion  about  X  being  in  Chicago,  appropriate  assertions  about 
naming  conventions  and  spatio-temporal  relations.  Of  oourse,  the  ordinary 
person  is  able  to  derive  these  conclusions  automatically  without  explicitly 
stating  the  suppressed  premises  for  the  syllogisms  leading  to  the  appro¬ 
priate  conclusions  from  the  fact  that  X  was  in  Chicago,  We  are,  however, 
ultimately  assured  of  the  validity  of  any  argument,  because  it  can  be 
reduced  to  a  deduction  from  premises  about  which  wo  do  not  entertain  any 
doubts.  The  models  of  the  real  word  that  the  human  being  possesses  allow 
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him  to  draw  accurate  conclusions  because  the  models  are  accurate  and 
because  the  automatic  conc',1  usion -generating  mechanisms  ho  possesses  are 
in  accord  with  explicit  deductive  reasoning.  To  the  extent  that  these 
conditions  are  not  met,  the  human  being's  inference  is  bound  to  result 
in  error—or,  at  best,  be  only  fortuitously  correct  despite  the  inva¬ 
lidity  of  the  underlying  argument  or  the  falsity  of  the  implicit  premises. 

\ 

The  task  for  information  systems  technology  is  not  to  simulate 
the  inferential  machinery  the  human  being  uses,  but  to  reproduce  its 
resets  reliably  whan  they  correspond  with  valid  arguments  from  aooeptable 
premises.  To  the  extent  that  the  simulation  of  Iranian  cognitive  processes 
furthers  this  end,  it  should  be  pursued  for  wholly  technological  reasons. 
There  have  been  several  attempts  to  incorporate  limited  models  of  naming 

oonventiona  or  spatial  relations  into  syBteme  of  deductive  inference  for  a 

•/ 

computer's  answering  of  questions.  Some  examples  ox  the  former  are  Groan's 
Eaaeball  Program  [2,4}  and  Lindsay's  Sad  Sam  Program  [43].  The  former  is 
able  to  deal  with  the  logical  relations  implicit  in  the  use  of  various 
baseball  terms)  the  latter  is  directed  to  the  analysis  of  kinship  rela¬ 
tions  implicit  in  limited  verbal  statements  about  how  one  porson  is  related 
to  another— for  example,  that  X  is  a  brother  of  Y  automatically  tells  Sad 
Sam  that  X  is  male  and  lias  a  common  ancestry  with  Y.  Examples  of  Inferen¬ 
tial,  systems  for  computers  that  use  models  of  spatial  relations  include 
Gelemter's  geometry  program  [21}  and  Raphael's  current  research  aimed  at 
developing  a  conversational  computer  that  can  answer  questions  about 
assertions  [66].1' 

.Jf  "  -  *  — w—r-T— r— 

The  last  example  also  models  non-spatial  relations .  Nor  is  the  primary 
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5'.  CONCLUSIONS 

TMs  section  presents  some  ad  hoc  conclusions  pertaining  to  the 
specific  areas  investigated  during  the  course  of  this  project.  The 
over-all  conclusions  are  presented  in  Section  6, 

These  conclusions  are  ad  hoc  because  they  represent  only  the  first 
stages  of  research  into  a  complex  problem.  The  results,  therefore,  are 
tentative.  Continued  research  could  lead  either  to  more  definite  results 
or  to  an  entirely  different  set  of  conclusions  based  upon  problems  that 
are  only  now  being  defined.  The  conclusions  are  organized  in  terms  of 
the  basic  questions  discussed  in  the  specification  of  retrieval  systems* 

5.1  DESCRIPTIVE  STRUCTURE  OF  RETRIEVAL  3I3TEM3 

The  most  popular  form  of  description  in  existing  retrieval  systems 
is  the  descriptor  list.  Although  other  forms  of  description  have  been 
considered,  they  have  not  been  developed  to  ary  significant  degree  of 
effectiveness.  The  considerations  presented  regarding  eoonony  of 
descriptions  can  serve  as  a  basis  for  further  development,  but  this 
development  remains  to  be  implemented, 

Given  that  the  descriptor  list  is  in  fact  used  as  the  mode  of 
description,  analytic  methods  can  be  helpful  in  selecting  the  particular 
set  of  descriptors  to  be  used.  These  methods  are  based  both  on  the 
logical  structure  of  any  given  document  collection  and  on  tho  use  of 
that  collection.  Since  dynamic  retrieval  systems  change  as  the  demands 
on  them  change  and  as  their  contents  shift,  corrective  methods  must  be 
used  to  keep  the  descriptor  set  updated.  The  invariants  that  are 


associated  with  relatedness  can  be  providently  used  to  keep  the  set- 
updated  by  constantly  bringing  the  system  classification  scheme  into 
conformity  with  the  users'  classification  scheme . 

5.2  ASSIGNMENT  OF  DESCRIPTORS  TO  DOCUMENTS 

The  rationale  for  assigning  descriptors  to  documents  automatically — 
that  is,  with  computational  techniques — is  that  a  greater  degree  of  con¬ 
sistency  will  be  achieved „  Human  beings  are  subject  to  numerous  vagaries 
and  Inconsistencies,  while  «  machine  is  invariant.  Since  automatic 
techniques  depend  upon  the  information  contained  in  a  document,  the 

i 

problem  Is  to  develop  computational  methods  that  will  enable  a  machine 
to  categorize  documents  accurately  on  the  basis  of  both  the  oxpllolt  and 
the  implicit  information— or,  more  precisely,  words— in  those  documents. 

Two  complementary  techniques  wera  analyzed  during  the  oourse  of  this 
projoot}  those  techniques  were  based  on  information  theory  and  game 
theory.  The  Information  theoretic  formulation  is  a  method  for  assessing 
the  individual  validity  of  descriptors  on  the  basis  of  clue  words  occur¬ 
ring  in  documents.  The  game  theoretic  formulation  provides  a  method 
for  selecting  an  optimal  set  of  clue  words. 

The  use  of  Information  theoretic  techniques  to  select  clue  words 
appears  to  be  a  promising  method  of  document  categorization.  From  the 
purely  heuristic  viewpoint  this  technique  seems  to  be  valuable  and  to 
represent  an  improvement  over  existing  techniques.  The  use  of  this 
technique  as  a  means  of  categorizing  documents  is  easily  mechanized. 

To  the  extent  that  the  occurrences  of  clue  words  arc  relatively 
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independent  of  each  other,  this  computationally  simpler  approach  should 
adequately  suffice  for  selecting  clue  words  and  is  an  attractive  solution 
to  the  problem.  However,  the  over-all  reliability  of  this  technique 
remains  in  doubt  because  it  is  not  at  all  certain  that  clue  words  ger  so 
convey  both  the  necessary  and  sufficient  information  for  correct  categori¬ 
sation  and  because  the  methods  for  selecting  the  best  clue  words  are  not 
ideal.  Ultimately 2  the  validity  of  this  teohnique,  particularly  in  com¬ 
parison  with  existing  methods,  warrants  empirical  verification. 

The  game  theoretic  approach  to  selecting  clue  words  is  theoretically 
more  appe ailing  but  more  difficult  to  execute  In  praotioe.  In  theory 
this  teohnique  will  in  faot  select  the  best  possible  set  of  olue  words. 

But  in  praotioe  it  ie  still  impossible  t6  develop  sufficient  statistics 
to  predict  the  best  possible  set.  As  yet  no  good  techniques  for  approx¬ 
imating  these  statistics  have  been  developed,  but  further  researoh  along 

i 

these  lines  should  be  undertaken. 

$.3  FILE  STRUCTURE 

The  quantitative  results  obtained  in  the  analysis  of  oertain  basic 
types  of  file  structures  demonstrate  the  value  of  trees  and  lists  in 
information  retrieval  systems.  These  results  must  be  tempered  by  a 
consideration  of  the  time  required  for  indexing  operations  in  list- 
oriented  file  atruoturesj  in  particular,  for  small  files  the  standard 
linear  methods  appear  to  be  the  best  because  of  the  bookkeeping  costs 
associated  with  lists.  The  standard  deviation  of  the  search  times 
required  for  indexed  tree3  is  small,  so  that  search  times  for  this  type 


of  structure  can  be  reliably  predicted,,  Linear  forms  of  storage,  on  the 
other  hand,  tend  to  hav'j  high  standard  deviations  and  highly  variable 
search  times. 

The  Multi -List  structure  cannot  be  directly  compared  with  the  basic 
types  of  file  structures  because  it  is  based  upon  retrieval  oil  more  than 
one  criterion  at  a  time.  The  Multi-List  technique  appears  to  be  an 
effective  way  of  perfoj;ming  retrieval  of  the  kinds  for  which  it  was 
designed?  however,  although  adding  items  to  the  file  or  altering  items 
is  fairly  easy,  deleting  items  is  a  complicated  prooesa.  The  value  of 
the  Multi  -Idat  eye  tern  probably  cannot  be  suitably  appraised  until  the 
system  is  used  in  a  practical  application,  slnoe  lte  approaoh  is  suf¬ 
ficiently  distinctive  to  make  it  difficult  analytically  to  compare 
Multi -List  against  other  methods. 

5,U  QUERY  PROCESSING 

The  type  of  query  processing  appropriate  to  a  given  Information 
retrieval  task  i3  highly  dependent  on  the  nature  of  the  task.  For  per¬ 
sonnel  files,  for  instance,  the  problem  is  virtually  trivial.  For  lit- 
orature  retrieval,  the  problem  becomes  more  difficult  and  techniques  sueh 
as  probabilistic  retrieval  becomes  useful*  For  intelligence  data,  quite 
sophisticated  search  and  inference  strategies  become  neoessary.  In  both 
literature  and  intelligence  information,  it  is  important  to  bear  in  mind 
the  amorphous  nature  of  the  user's  question  as  contracted  with  his  query. 

Probabilistic  retrieval  should  be  a  useful  method  ft-,  increasing  the 
effectiveness  of  literature  retrieval  through  the  use  of  additional 


information— namely,  the  probability  that  a  given  categorization  of  a 
document  is  correct.  The  distributional  statistics  needed  for  compound 
retrievals  require  a  significant  amount  of  bookkeeping,  but  thi3  cost 
may  well  be  repaid,  in  terms  of  system  effectiveness.  For  single-category 
retrieval,  of  course,  no  statistics  are  needed.  The  effect  of  raising 
or  lowering  the  retrieval  cutoff  point  permits  a  trade-off  of  false  drupe 
against  missing  information.  However,  there  may  be  room  for  improvement 
in  the  particular  parameters  used  in  the  optimization  of  the  goodness  of 
retrieval!  parameters  based  on  ratios  rather  than  on  absolute  numbers 
of  documents  might  possibly  be  more  effective. 

It  is  apparent  that  in  any  attempt  to  perform  oontent  retrieval 
rather  than  dooumant  retrieval,  query  processing  lies  at  the  heart  of 
the  problem.  The  system  will  need  to  perform  a  great  dsal  of  inference, 
and  the  ways  that  this  inferential  prooeee  can  bo  performed  are  not  at 
all  dear  as  yet.  In  addition,  severe  problems  exist  with  respect  to 
the  semantics  of  the  data  and  the  resolution  of  ambiguity,  although  there 
are  some  promising  approaches  in  this  area,  particularly  the  a'^plioation 
of  sense-value  theory.  The  work  on  the  theory  of  questioning  is  still 
embryonic!  however,  some  progress  has  been  made  in  this  area  by  otter 
investigators . 


6.  OVER-ALL  CONCLUSIONS 

Ths  state -of -the -art  an  information  retrieval  i3  characterized  by 
two  different  approaches: 

(a)  Ad  hoc  methods  for  solving  logically  straightforward  problems 
with  the  greatest  possible  efficiency. 

(b)  Theoretical  efforts  to  resolve  the  difficult  problems  associated 
with  descriptive  structures,  assigning  descriptions  to  docu¬ 
ments,  file  structure  and  memory  organisation,  and  query 
processing. 

This  pro j  eot  has  been  oriented  toward  the  second  approaoh.  The  *ppwpr,U+* 
approach  is  strictly  a  function  of  the  particular  application  being  dealt 
with.  For  retrieval  on  personnel  files  and  similar  applications,  a  highly 
coordinated  approaoh  to  develop  a  complete  specialized  system  is  sufficient. 
The  primary  question  then  is  one  of  application.  For  problems  suoh  as 
general  documentation  and  intelligence  analysis,  there  does  not  appear  to 
be  any  vsy  to  short-cut  the  truly  dif fioult  problems .  This  study  has 
highlighted  soma  of  these  problems  and  developed  a  few  tentative  steps 
towards  solving  them. 

The  frame  of  reference  for  the  research  performed  during  the  oourse 
of  this  projeot  was  a  general  system  model  in  vhioh  two  processes  occur 
simultaneously  and  independently:  entering  documents  or  information 
about  doouments  into  the  system}  and  respond!, ng  to  queries  related  to 
specific  requirements  for  information.  Although  four  general  research 
tasks  were  isolated  and  analyzed,  the  content  o.f  these  tasks  was 
interrelated.  Thus  the  descriptive  structure  of  retrieval  systems  and 
the  assignment  of  descriptors  are  interdependent  and  both  are  intrin¬ 
sically  related  to  the  ultimate  problem  of  query  processing.  These 
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factors  also  impinge  upon  the  correlated  functions  of  storage  and 
retrieval.  In  storage  devices  or  memories  neither  size  nor  sjwed  are 
the  important  problem;  rather,  it  is  a  question  of  organization,  the 
structure  of  information  as  it  pertains  to  the  essential  requirements 
of  serving  a  iisor's  demand  for  information. 

This  report  has  emphasized  possible  techniques  for  automating  all 
storage  and  retrieval  processes.  A  tacit  assumption  underlying  this 
stress  has  been  the  problems  of  large  information  systems.  Manual 
techniques  are  still  suitable  for  relatively  small  oolleotlons  of 
information.  But,  granting  the  assumption  of  magnitude,  It  is  essen¬ 
tial  to  develop  techniques  for  the  analysis  of  information  by  maohinoo, 
primarily  because  human  beings  are  notoriously  inconsistent  and  prone 
to  error.  Only  in  large  systems  do  these  human  tendencies  lead  to 
inefficiency  ana  merreotivenesn . 

At  this  stage  of  the  research  prooess  lcnowledge  about  the  nature 
of  the  total  problem  is  insufficient,  I^r  this  reason  the  conclusions 
about  the  research  performed  are  tentative.  Each  area  could  be  studied 
further  with  more  definitive  results;  alternatively  techniques  that 
are  potentially  more  beneficial  could  evolve.  Any  future  research  would 
also  benefit  from  a  test  bed  of  data  that  could  be  usod  empirically  to 
tost  theoretical  concepts. 

One  fact  is  clear;  It  is  still  premature  to  develop  special  purpose 
equipment  for  information  storage  and  retrieval.  Such  a  step  should  be 
deferred  until  the  requisite  research  and  empirical  verification  has 
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produced  reasonably  complete  knowledge  about  the  problem  and  a 
comprehensive  description  of  the  requirements . 
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7.  RECOMMENDATIONS 


Thn  concept  of  information  retrieval  has  dc gone rated  from  a 
rigorously  defined  problem  to  a  general  catch-all  for  a  variety  of 
problems.  The  range  of  the  popular  description  includes  both  the  dif¬ 
ficult  and  the  mundane.  This  study  has  attempted  to  limit  the  defini¬ 
tion  and  the  soope  of  information  retrieval  to  the  difficult  problems 
related  either  to  scientific  and  technical  documentation  or  to  intelligence 
analysis »  '  ■ 

Both  documentation  and  intelligence  analysis  systems  are  characterised 
by  a  particular  attribute!  their  oontant  and  nature  cannot  be  defined 
a  priori.  Both  are  dependent  upon  their  information  oontent  for  their 
descriptions.  Unless  these  deeoriptions  are  satisfactorily  specified, 
and  no  existing  method  permits  adequate  specification,  the  retrieval 
systems  will  be  virtually  useless. 

The  first  recommendation  may,  therefore,  be  startling.  If  the  con¬ 
templated  system  la  definable  a  priori  and  if  the  information  oontent  la 
well  structured,  no  further  research  is  required  to  describe  a  suitable 
retrieval  system.  Personnel  files  are  the  ubiquitous  example.  The 
appropriate  subject  in  this  oase  is  not  research  but  either  systems  or 
applications  analysis.  If  the  objective  is  to  develop  equipment,  then 
the  nature  of  the  information  system  must  bo  described,  and  operational 
characteristics  must  be  specified  for  speed,  accuracy,  efficiency,  find 
effectiveness . 

The  second  recommendation  has  evolved  from  the  difficulty  of  adhering 
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to  a  pure  definition  of  information  retrieval.  This  recommendation  also 


follows  from  the  current  state  of  knowledge  about  the  subject.  The 
subject  of  information  retrieval  has  become  too  broad,  while  specific 


problems  confronted  in  information  retrieval  have  been  either  roughly 
or  specifically  defined  during  the  course  of  several  research  programs, 
including  this  one  sponsored  by  03AEL.  Further  research  in  information 
retrieval  per  se  would  result  in  an  indefinitely  structured  projeot. 


Funds  would  be  mors  fruitfully  expended  on  research  projects  related  to 


speoifio  problem  areas  enoompaased  by  Information  retrieval. 


The  need  for  apeoial  studies,  defined  and  speolfled  as  such,  is 
urgent.  The  research  conducted  during  this  project.,  for  example,  con¬ 
stitutes  only  a  beginning.  This  recommendation,  therefore,  is  presented 
as  a  neoessaxy  next  step  in  advanoing  the  state-of-the-art  and  in  enhanc¬ 
ing  the  use  of  automated  techniques,  specifically  oomputer-oriented 
techniques . 


The  principal  1 joommandation  for  future  work  is  that  it  be  directed 
more  towards  npeolfio  types  of  problems.  For  applications  where  the 
problems  of  developing  a  descriptive  structure  and  as signing  descriptions 
to  doouments  are  trivial,  it  is  advisable  bo  develop  an  ad  hoc  system 
that  is  highly  coordinatod  internally  and  specialized  for  a  particular 
problem.  Such  systems  need  not  be  completely  specialized  because  a 
system  that  is  appropriate  for  personnel  records  may  also  be  appropriate 
for  parts  listings  or  for  literature  with  an  existing  fixed  set  of 
categories  and  manual  categorization.  However,  it  is  inadvisable  to 
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try  to  attack  problems  such  as  intelligence  analysis  with  a  similar 


system. 


The  importance  of  the  more  difficult  problems  is  sufficiently  great 
so  that  a  long-term  and  continuing  research  program  is  thoroughly 
warranted.  This  program  would  require  the  extension  of  some  of  the 
ideas  developed  in  this  project  ■■tithin  a  more  rigorous  theoretioml 
framework.  The  studies  should  consider  the  following  problems  as  well 
Athwra  s 

H,  *  i;  ' 

(a)  Descriptive  Struoture  -  The  work  performed  during  this  projeot 
has  only  begun  to  attack  this  problem.  It  is  necessary  to 

,  develop  a  formal,  perhapa  mathematical,  theory  of  the  struoture 

of  knowledge  and  to  base  the  descriptive  scheme  on  this 
struoture.  The  development  of  a  formal  theory  has  been 
attempted,  but  as  yet  the  efforts  have  been  inadequate  to 
the  task.  A  solid  theory  of  desoriptive  struoture  la  the 
essential  underpinning  of  any  oontent  retrieval  system?  until 
this  theory  has  been  oompleted,  all  other  eonolusions  are  at 
best  tentative, 

(b)  Lingulstio  Analysis  -  It  is  recommended  that  existing  work  in 
mechanical  translation  of  languages  be  applied  to  the  transfor¬ 
mation  of  natiiral  languages  to  formal  languages  suitable  for 
deductive  reasoning.  Many  of  the  problems  of  natural  language 
translation  can  be  sidestepped  in  this  effort,  since  the 
translational  defects  will  not  seriously  impair-  Uw  off  deliveries 
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of  a  retrieval  system.  For  instance,  the  problem  of  translating 
a  word  with  several  alternative  meanings  can  be  considerably 
simplified,  since  for  most  purposes  the  mere  identity  of  words 
will  be  sufficient  for  the  kinds  of  deductions  to  be  performed. 

It  should  be  emphasized  that  this  recommendation  is  for  the 
application  of  existing  work  in  a  different  area  rather  than 
for  totally  new  investigations. 

(c)  Methods  of  Inference  -  Qiven  a  large  body  of  formal  statements, 
methods  are  needed  for  obtaining  the  desired  logical  consequences 
of  these  etatementa.  The  problem  resembles,  but  is  not  identical 
to,  the  problem  of  developing  formal  proof  prooeduree  for 
symbolio  logio.  The  major  difference  Is  that  relatively 
immediate  Inferences  are  to  be  drawn  from  a  large  base  of 
information  rather  than  quite  deep  inferences  from  a  email 

base  of  Information.  The  solution  of  this  problem  la  alao 
essential  for  an  effective  oontent  retrieval  system. 

(d)  Development  of  Query  Languages  -  The  particular  mode  of  oom- 
munioation  between  the  user  and  the  retrieval  system  uiust  be 
studied  in  detail.  It  is  recommended  that  work  should  be 
performed  in  this  area,  but  not  until  the  other  areas  have 
boen  more  thoroughly  developed. 
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ft.  IDENTIFICATION  OF  PERSONNEL 


I 

I 

i 

i 

i 
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0.1  PERSONNEL  ASSIGNMENTS 

The  following  personnel  were  assigned  to  this  project  during  the 

course  of  the  contract » 

Jacques  Harlow 
Paul  w,  Abrahams,  So.D.* 

George  Greenberg,  Ph.D, 

Quentin  A.  Darmstadt 
Alexander  Ssejman 
Alfred  Trachtenberg 
Maralyn  W.  Llndenlaub 
Tho  asterisk  (*)  indicates  those  personnel  who  contributed  to  the  project 
during  the  final,  quarter.  Both  Dra.  Greenberg  and  Abrahams  aoted  aa 
associate  investigators  at  different  tinea  in  the  oourae  of  tide  research 
program)  particularly,  Dr.  Abrahams  filled  this  role  during  the  last 
two  quarters  and  contributed  significantly  to  the  integration  of  the 
several  researoh  tasks. 

The  approximate  number  of  man-hours  by  title  expended  during  the 
total  contractual  period  waet 


Management  and  Supervision 

700 

Research  Specialist 

1500 

Senior  Specialist 

4500 

Senior  Program  Analyst 

4£oo 

Clerical 

300 

The  titles  in  the  previous  paragraph  reflect  each  person's  position 


Principal  Investigator 
Research  Specialist 
Researoh  Specialist 
Reeearoh  Specialist 
Senior  Specialist 
Senior  Program  Analyst 
Senior  Program  Analyst 
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during  the  last  quarter  of  has  participation  in  the  project.  Therefore, 
the  distribution  of  man-hoivrs  differs  from  the  distribution  of  present 
titles . 

8.2  BACKGROUND  OF  PERSONNEL 

The  background  of  eaoh  person  assigned  to  this  project  was  summarised 
in  the  quarterly  reports. 
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9.  AFPENDIOES 


9,1  APPENDIX  A  -  Maxima  and  Minima  o.f  thn  Measures 
In  this  appendix  the  behavior  of  the  measures  of  goodness  and  tho 
various  entropy  functions  will  bo  examined.  Maxima  and  minima  in  terms 
of  the  Pj  and  p^  are  summarized  in  Tables  3  and  U, 

For  these  tables  it  is  assumed  that  A  is  chosen  such  that  A  -  l/p 

fi 

where  p  is  the  smallest  p.j  that  is,  <  p.  for  all  J.  For  the 
e  j  ej 

functions  of  Table  3— H,  HA’  and  3^— .the  pertinent  values  are  the 

maximum  and  minimum  values  in  terms  of  a  given  p.  and  the  absolute  imax- 
imum  and  minimum  values  of  eaoh  function. 

For  H  and  H^,  maxima  are  reached  when  the  probabilities  are  equal 
or,  for  a  particular  pe,  when  the  other  p^  are  equal,  minima  are  reached 
when  one  probability  becomes  a  maximum  and  the  rest  are  minima. 


While  doee  not  roach  an  absolute  maximum  whan  H  does,  since  it 
was  assumed  that  A  -  l/p ,  it  doss  rwaoh  a  maximum  together  with  H  for  a 
particular  ,  Then i 

Ha  -  -  E  Pj  log  pj  +  log  A  -  -  E  Pj  log  Pj  -  log  p0 

j  5 


E  Pu  log  p,  -  (1  +  p  )  log  P 

d/e  J  3  6  e 


(A-l) 


Therefore,  becomes  a  maximum  for  a  particular  p0  when 
for  j  /  e.  Then: 


,k  -  1 


Amax 


(1  -  pQ)  log  (T~-)  -  (1  +  pQ)  log  pe 


(A-2) 
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TfiTnVR  3.  Maxima  and  K-irLna  of  Entropy  Functions 
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TAELS  3  (Continue:.!) .  KaxLma  and  Smu  of  Entropy  Functions 
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£  p ,  for  all  j 


The  largest  occurs  when  a  l/N.  Then: 

lW,™*  ’  »  4>  1o«  »  *  <J  -  H>  ^  <7-4)  (W) 

1  "  H 

H.  becomes  a  minimum  for  a  particular  p  when  H  doeaj  that  is,  when  the 

A  0 

maximum  p^,  Pt  ■  1  -  (k  -  1)  pft,  and  pj  =  pg  for  j  /  t,  where  pQ  fi  p^ 
for  all  j .  Thant 

HAmin  -  -  Cl  -  (k  -  1)  PQ]  log[l  -  (k  -  1)  p#] 

-  [1  +  (k  -  1)  pe]  log  p8  (A-U) 

The  amallaat  oooura  when  pe  -  l/k.  Them 

^Aabandn  “  2  l0K  k  (A-5) 

3i  becomes  a  maximum  when  for  all  j.  This  maximum  can  be 

derived  by  using  Gibbs'  theorem,  as  in  Watanabe  [84]: 

Simax  ■  1°8  A  -  -  log  pe  (A-6) 


The  largest  aimx 

S  «• 

iabsmax 


oooura  whan  p„ 

ft 

log  N 


i/to. 


(A-V) 


becomes  a  minimum  when  p.^  beoomes  one  for  the  particular  j  for 
which  Pj  is  smallest.  Then: 


Simin  =  -  l0S 


(A-8) 


But; 

So. 


A  -  l/pc 


S.  ,  -  0 

3jnin 


(A -9) 
(A-10) 
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For  the  functions  of  Table  1; — Mg,  Mg,  and  Mj( — there  are  three 
maximum  and  minimum  values:  the  maxima  and  minima  for  a  Riven  p  distri¬ 
bution}  the  maxima  and  minima  when  only  pg  is  Riven;  and  the  absolute 
maxima  and.  minima.  To  keep  the  notation  consistent  with  that  of  Table  3, 
these  maxima,  and  minima  will  be  Indicated  as  follows: 

^maxj*  etc* 

are  the  maxima  for  a  Riven  distribution.  Similarly, 

Hlminj»  ^minj*  Bto, 

are  the  minima  for  a  given  distribution.  ! 

i 

^litiax*  ”2 max’  W  *W  "to‘»  aro  the  m"tlm  and  min:lj,,a  vfn 
only  pfl  is  given,  and  ^ftbflMX,  themin'  ^absmdn'  oto*'  ara 

the  absolute  maxima  and  minima, 


■  H  -  is  maximizod  for  a  particular  distribution  .when 
is  a  minimum  "  0) ,  Then  MlmAXj  is  simply  the  a  priori  entropy  H, 

whioh  is  maximized  for  a  particular  pe,  is  simply  the  a  priori 

entropy  maximized,  ^labsmax  is  obsolute  maximum  of  the  a 


priori  entropy. 

Similarly  the  minima  of  are  obtained  when  is  set  equal  to 
Hlmax  ^imax  "  loK  hy  minimizing  the  a  priori  entropy. 

Mg  =  H  -  is  maximized  when  is  a  minimum  (Sifflin  ■  0);  the  maxima 

are  simply  the  maxima  of  the  a  priori  entropy.  Mg  is  minimized  when 

5,  =  S.  >■  -  log  p  ;  M,  .  “  H  .  -  S .  „„  when  H  =  H.  in  addition, 

i  imax  B  re  ’  Vmin  min  nmax  ran 


213 


'■'1 1. 


I 

I 

I 


■0 


a 

sa 
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Mo  ,  .  occurs  when  H  =  H  ,  .  .  M,  =  H.  -  S.  is  maximized  when 

absmin  absmin  3  A  i 


Si  “  Simin'  the  maxima  are  h>  HAmax’  and  HAabsmax'  respectively.  The 
minima  of  are  not  as  obvious,  for  the  conditions  of  maximizing  and 
minimizing  can  be  contradictory.  It  is  best  to  analyze  the  minima  of 


M,  as  follows: 
3 


M3  “  HA  “  Si  "  "  *  P.1  log  h  +  log  A  +  J  hi  log  ^ 
"  -  E  Pj  log  Pj  +  t  p1;j  log 

d  d  d 


(A-ll) 


For  a  partioular  distribution,  ooours  when  p^  ■  p^  for 

all  J .  Therefore: 


M3mi*J  '  -  I  P3  P,  ■  » 


Then  for  a  partioular  p  : 

o 


M3min  "  *Wn 

and  the  absolute  minimum  is  simply: 

M  ■  u 

3ftbsmin  absmin 


(A-12) 

(A-13) 

(A-UU) 


is  the  simplest  measure  of  them  all,  reaching  a  maximum  when  S 
is  minimum,  and  a  minimum  when  is  maximum. 


i 


-  log  A  -  Si 


+  r  p, 


id 


log  -ii 
*1 


(A-lS) 


That  this  measure  is  always  greater  than  or  equal  to  zero  can  be  shown 
by  applying  Gibbs'  theorems 
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(A-16) 


\  ”  *  pi,1 108  pij  -  *  log  pj 


But: 


Sp  log  p.  .  -rp,,  log  p.  *  0  (Gibbs'  theorem)  (A-17) 

•  lj  J  J 


Therefore} 


*  0. 


(A-18) 


The  maximum  of  is  i 

\aaxj  "  *W  "  log  K 

The  absolute  maximum  ooours  when  p#  ■  l/Nj  then,  A  -  N  andi 
^.bsmax  "  lo«  N 


(A-19) 


(A“20) 
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9.2 


APPENDIX  B  -  Derivation  ox  the  Predictor  Effectiveness  Measure 


J^Fr 


From  Some  Fundamental  Dofj.nl  lion:;  of  Information 


The  information,  I,  supplied  by  An  event  is  usually  defined  as  the 
difference  between  the  a  priori  and  a  posteriori  entropies.  In  this 
oases 

I  -  H  -  ^  (B~l) 


Where j 


And) 


-  -  t  Pj  log  Pj 


%  "  -  *  Pij l0«  pij 


(B-2) 


(B-3) 


To  ovoroome  tlMt  difficulty  of  having  a  negative  information  quantity  at 
times,  whloh  does  not  oonour  with  our  intuitive  notions  of  information, 
Watanahe  [6);]  suggests  that  relative  entropy  funotions  should  be  used 
instead  of  the  usual  entropy  funotions  H  and  The  relative  entropy, 
S,  in  general  1st 


n 


Where; 


3  “  '  J  h 108  wA 


IT.  ■  the  probability  distribution  under  study 
J 


(B-U) 


*  the  a  priori,  or  referenoe,  probability  distribution 
B  -  a  positive  constant. 

Then,  by  using  the  standard  definition  of  information,  the  difference 
between  the  two  entropies,  except  for  substituting  relative  entropies 
this  time,  we  obtain: 
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Ir  -  s(Pj)  -  S(p..) 


(B“S‘) 


To  ova lua to  V> (p, ) ,  lot: 

j 

nj "  pj 
q3  "  pd 

B  -  A 

To  evaluate  S(p^j),  remains  equal  to  Pj  and  B  -  A,  but: 
nj  “  plj 

Then: 

fl(pj)  ■  log  A 

And  | 

a!ri3>  '  '  *  >“13  l“«  ■  S1 

Them 

I,  -  lo,  A  -  3i .  r  hi  log 

Therefore i 


(B-6) 


(B-7) 

(B-8) 


(B— 9) 


(B-10) 


Ir  -  \  (B-ll) 

and,  t hen  measures  the  amount  of  information  supplied  by  the  ooourrence 
of  word  , 


Mjj  can  also  bo  derived  by  using  the  definition  of  information  used  by 
Goldman  [23] s  the  log  of  the  ratio  of  the  a  posteriori  to  the  a  priori 
probability.  Symbolically,  for  this  cases 


I' 


(B-12) 
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If  this  quantity  in  averaged  over  all.  i  and  then  the  usual  information 
■pmuM  ty  re-uO  +  s.  ,  this  quantity  should  he  averaged  over  -j  only  ; 

and  thi*.:  averaging  must  be  done  for  a  particular  i .  The  quantity  desired 
is  i 

<1'>j|i"'>e^>j|i  <b-w) 

It  is  neoeasary,  then,  to  use  the  conditional  probability  distribution 
p .  .  to  obtain  the  correct  average.  Thent 

1 J 

And] 

<I*>a|l  ■  \  <B-!« 


9,'t  APPENDIX  C  -  Existing  Methods  of  Documont  Doscription 

9.3-1  Indexing  and  Automation  -  A  fundamental  aspect  of  today's 
indexing  schemes  la  their  nil. ’mate  adaptability  to  automated  procedures. 

These  procedures  have  been  used  to  produce  many  different  types  of 
.indexes,  including  author,  citation,  report  number,  conventional  subject- 
heading,  and  coordinate  indexes.  Coordinate  indexing,  which  may  be  con¬ 
sidered  as  one  of  the  first  steps  beyond  the  traditional  manual  indexing 
systems,  consists  of  the  description  of  information  contained  in  docu¬ 
ments  by  the  use  of  unlt-oonoepts ,  These  unlt-oonoepta  are  called  by 
many  names i  Uni terms  (Taube),  keywords  (Luhn),  and  descriptors  (Mooera), 
Unit-concepts  oan  be  characterized  by  the  controls  placed  upon  them.  For 
example,  if  we  extract  words  dlreotly  from  documents  and  use  these  words 
without  further  controls  of  any  kind  (auoh  unit-oonoepts  have  been  called 
Uniterm ) ,  we  ha  vs  the  basis  of  a  pexmitsd  or  KWIC  indexing  scheme.  We 
shall  review  this  indexing  method  in  some  detail  and  analyse  some  of  the 
effeots  that  suoh  a  control-free  word  system  appears  to  be  having  on 
indexers  and  authors  alike. 

The  use  of  a  Uniterm  system  oan  Inflict  a  large  number  of  synonyms 

upon  a  user.  For  example,  if  we  use  Roget'u  Thesaurus  as  an  authority, 

i  ; 

the  word  "hardness”  has  such  synonyms  ass  rigidity,,  firmness,  stiffness, 
inflexibility,  temper,  toughness,  otc.  Such  a  system  of  Uniterms  needs 
cross-referencing  from  ono  word  to  synonyms  or  related  words.  The 
Chemical  Engineering  Thesaurus  and  the  ASTIA  Thesaurus  of  Descriptors 
(2nd  edition)  are  examples  of  such  referencing.  Such  a  free  vocabulary 
may  be  transformed  into  a  formal  descriptor  language  that  will  be 
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aynoryra  /too.  czneo  explicit  definitions  or  scope  notes  will  exist  for 
('■>'  h  dnr.n  iptor.  Tf  the  number  of  descriptors  to  bo  used  in  not  fixed, 
then  at  .Least  the  rate  of  growth  should  bo  subject  to  careful  regulations,. 
Since  only  a  limited  number  of  descriptors  can  efficiently  be  assigned 
to  a  text,  Jacobson  [371  has  assumed  that  only  a  limited  amount  of  text 
can  he  efficiently  indexed.  He  further  euggeste  the  need  to  divide  the 
text  of  documents  into  distinct  portions  and  to  eubjeot  each  portion  to 
certain  indexing  regulations. 

As  more  descriptors  are  assigned  to  a  document  in  an  effort  to 
anticipate  novel  requests  for  information,  the  possibility  of  increasing 
the  noise,  or  non-r*l*v*nt  information,  is  increased,  Soveral  derioea 
have  been  incorporated  into  descriptor  sohamei  to  reduoo  this  noise, 

Maron  and  Kuhns  [1*81  euggeste  that  eaoh  descriptor  may  be  flighted  aooord- 
ing  to  Its  relevance  for  the  particular  document  involved.  Hllf  [3i*l 
reports  a  practical  approaoh  to  weighting  by  the  use  of  an  asterisk  ‘bo 
indicate  those  descriptors  of  major  interest, 

9,3.2  Facet  Analysis  and  Role  Indicators  -  One  taohnique  for 
organizing  the  proliferation  of  descriptors  is  known  as  facet  analysis. 

The  entire  set  of  descriptors  is  grouped  into  facets ,  The  descriptors 
within  a  facet  can  be  viewed  as  the  possible  answers  to  a  question  con¬ 
cerning  tho  contents  of  a  document,  to  be  classified.  Thus  a  faoet  repre¬ 
sents  the  question  itself;  ideally,  facets  should  bo  chosen  so  that 
their  corresponding  questions  exhaust  the  information  on  how  to  classify 
the  document  and,  at  the  same  time,  so  tn&t  there  is  a  minimal  overlap 
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(hopefully,  none)  of  the  informational  content  in  the  answers  to  the 
nonet  ions ..  If,  for  a  particular  rfncumont,  the  (mention  represented  by 
a  facet,  io  meaningless,  no  descriptor  from  this  facet  will  be  assigned 
to  tha  document. 

In  terms  of  the  Multi-List  ayetom  discusaed  in  Section  U.ln2, 
attributes  may  be  viewed  m  faoeto  and  values  of  attributes,  as  descrip¬ 
tors  within  facets.  If  attributes  art*  set  up  by  human  beings,  they  may 
oorreepond  to  natural  questions]  but  if  they  are  set  up  mechanically, 
they  may  oorreepond  to  quite  oomplloated  and  artificial  questions, 

A  dieoueeion  of  facet  analyeie  appears  in  Vickery  [811.  Vlokery 
speoiiiea  the  produot  of  a  faoet  analysis  to  be  a  set  of  oohodules  in 
which  term*  are  first  grouped  into  well-defined  facet*  and  then— within 
each  facet— *mp.gfd  in  s  nloae  order.  The  olaaaifler  using  these  sohod- 
ules  la  aidad  beoauao  the  structure  of  each  subject  is  displayed.  Tho 
selection  of  facets  is  diotated  by  the  user's  requirements.  Ae  an  example, 
a  survey  of  1.000  kesesroh  physicists  Identified  some  of  the  following 
performance  characteristics  of  a  .reference  retrieval  systems  it  should 
specify  type  of  research  (whether  experimental  or  theoretical) j  it 
should  specify  aspect  of  research  (property,  object,  method)  [21,  An 
example  of  a  working  system  da  an  engineering  field  consists  of  a  descrip¬ 
tor  vocabulary  of  600  words  within  a  framework  of  nine  facets  [3^1.  Hayes 
[301  Vias  also  pointed  out  tho  advantages  of  facet  analysis  from  tho  auto¬ 
mation  point  of  view,  Slamecka  [72],  however,  feels  it  is  conjectural 
whether  facet  analysis  helps  to  improve  the  quality  of  indexing. 
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(’ loroly  minted  to  facet  analyajc  is  another  method  known  as  role 
’■  n<l ; '  itii-n.  When  l!up  method  .ir;  used,  each  done ripper  h.e;  <i{,M-n>lod  In 
at  a  suflJ.x  that  says  what,  sort  of  descriptor  it  is;  or,  in  torn®  of 
facet  analysis,  what  facet  does  it  belong  to.  Those  suffixes  are  known 
as  role  indicators.  For  ox-ample,  in  the  Western  Reserve  University  sys¬ 
tem,  which  utilizes  twenty -four-  role  indicators,  the  suffix  KAM  Indicates 
a  descriptor  referring  to  a  process  and  the  suffix  KIT,  a  descriptor  of 
time  or  place.  Costello  and  Wall  use  eleven  role  indicators,  Farradane 
[19]  has  proposed  the  use  of  nine,  and  the  Engineering  Joint  Council  [59] 
reoommends  the  use  of  ten. 

Tt,  in  difficult  to  ascertain  the  relative  effectiveness  of  the  various 
descriptor  organizations  used  in  indexing.  The  Cranfield  Projeot  [69] 
was  designed  as  an  investigation  Into  the  relative  retrieval  efficiency 
of  four  forms  of  indexing 1  universal  decimal  classification,  a  eubjeoi- 
he&diny  system,  a  faceted  classification,  and  the  Uniterm  system.  Tbi., ......... 

results  of  this  projeot  are  now  available,  but  must  be  interpreted  only 
in  the  light  of  a  thorough  knowledge  of  the  projeot. 

9.3.3  KWIQ  Indexing  -  The  procedure  commonly  known  as  permuted 
indexing  or  KWTC  indexing— that  in,  Key -Word-In-Context  indexes— is  the 
most  sophisticated  of  today’s  operational  automated  indexing  schemes. 

Yet  it  is  riot  without  its  critics,  and  certainly  not  without  inherent 
limitations.  We  shall  briefly  review  the  nattire  of  this  system  as  well 
as  some  present  thoughts  on  making  such  indexing  more  effective. 

KWIC  indexing  may  be  carried  out  on  various  levels ;  the  process 
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may  bo  applied  to  the  title,  the  abstract,  portions  of  the  text,  or, 
mdecu,  the  entire,  toxt,  Thor,  far  the  method  har  had  its  greatest  reported 
use  in  connection  with  titlos,  KWIC  indexing  uses  tho  content  words  in 


the  title  of  an  article  as  index  terms.  A  list  of  non-significant  words 
is  prepared  for  use  in  processing  a  KWIC  index.  This  list  would  include 
words  such  as  "an",  "of",  '  In",  "the",  "at",  "are",  etc.  Each  word  within 
the  title  that  is  not  on  the  non-significant  word  list  is  oyolioally 
permuted  in  suoh  a  way  that  the  word  is  aligned  on  a  particular  column 
so  that  alphabetical  sequence  is  obserrable.  Tor  example,  consider  the 
title i 


"An  Evaluation  of  KWIC  Indexing  Methods  in  Chemistry. " 
This  title  would  be  arranged  as  follows  in  a  KWIC  index i 


INDEXING  METHODS  IN 
IN  CHEMTSTPY.  AN 
EVALUATION  OF  KWIC 
AN  EVALUATION  OF 
OF  KWIC  INDEXINQ 


CHEMISTRY.  AN  EVALUATION  OF  KWIO 
EVALUATION  OF  KWIC  INDEXING  METHODS 
INDEXING  METHODS  IN  CHEMISTRY.  AN 
KWIO  INDEXING  METHODS  IN  CHEMISTRY. 
METHODS  IN  CHEMISTRY.  AN  EVALUATION 


The  first  use  of  KWIO  indexing  was  reported  at  the  International  Conference 
on  Solentlfio  Information  in  Washington,  D.  C,f  in  19$9  [7].  Sinoe  that 


time  the  KWIC  technique  has  been  used  to  index  the  literatures  of  chemistry, 


biology,  aerospace,  and  a  score  of  other  .fields. 


9. 3. 3.1  The  Descriptive  Power  of  Titles  -  The  KWIC  indexing 
procedure  is  .based  upon  the  assumption  that  the  title  of  an  article  is 
descriptive  of  thB  information  content  of  the  article  and  significantly 
related  to  it.  Some  of  the  reported  problems  with  the  system  have  been 
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based  on'  U.e  nimpio  ia.i  that  most  of  these  .indexes  have  used  a  single 
i,,,„  /,o - •  ;w i .*  ini  j.ii..'  fi.itod  ! hn t  doos  not.  effectively  handle  tho  longor 
titles..  M-  rignit  i-:anv.  h-’W^ver .  a***  those  problems  that  sewn  to  attack 
Uv>  fundamental  assumption  cX  this  indexing  method.  The  problem  is 
described  in  various  ways..  MeyeY-’Jhleriiied  [u9]  states  that  an  analysis 
of  different.  KWIC  indexes  has  shorn  that  titles  are  often  not  significant 
enough  for  tbs  public  atiom  and  Penny,,  at  ai,  [57]  have  eaid  that  the 
literature  must  be  examined  thoroughly  in  order  to  determine  content 
because  the  content  is  not  always  obvious  from  the  abstract.  Nevbaker 
[51],  on  the  other  hand,  alaima  that  titles  contain  sufficient  indexing 
information  for  meet  retrieval  unrA  (cations 

Data  are  occasionally  presented  to  subatantiate  a  position  on 
the  matter.  For  example,  Slamecka  and  2 unde  [731  report  that,  when 
evaluated  for  use  ir.  permuted  end  KWIC  indexes,  between  $ 0  end  90  per¬ 
cent  of  author-prepared  document,  title*  (depending  on  subject  field  and 
other  factors )  were  found  fully  to  reflect  the  aubjaot  terms  to  vhioh 
thsir  documents  were  assigned  by  human  indexers.  In  a  preliminary 
examination  of  various  legal  information  problems  by  the  American  Bar 
Foundation  [13]  an  experiment  vas  conducted  in  which  KWIC  indexing  of 
titles  was  oompaxed  with  indexing  by  the  subject-heading  classification 
system,  The  results  shewed  that  percent  of  the  title  entries  con¬ 
tained  as  keywords  compared  to  one  or  more  of  the  subject-heading  words 
under  whi.h  they  had  been  indexed  and  15>1  percent  contained  logical 
equivalents..  In  a  report  by  White  [ BS  |  of  experiments  on  methods  of 
indexing  the  196.’  issues  of  tho  Abstracts  of  Computer  literature,  the 


pormutod-title-indexing  retrieved  only  '>2  percent  of  the  .information. 

Data  from  comparative  toots  cf  thin  IcLnd  will  vary  doponding  on 

such  items  as  tost  criteria  and  definitions,  indexing  systems  being  com¬ 
pared,  and  subject  field  being  indexed.  Borns tein  [10]  states  that  the 
conflict  in  Swanson’s  [?6]  results  can  be  traoed  to  the  different  experi¬ 
mental  methods  used  and  the  definition  of  the  criteria  of  success.  Por 
example,  ve  recently  eompareri  the  descriptors  (part  of  a  faceted  classi¬ 
fication  eoheme)  used  to  Index  162  papers  in  the  field  of  aolentifio 
communication  [60]  and  the  terms  in  a  KWIO  index  of  the  same  papers. 

Only  13  percent  of  the  pepere  had  titles  that  rsfleoted  fully  the  descrip¬ 
tor*  used  to  index  ths  seme  documents, 

9.3. 3. 2  Querying  Problems  Using  KWIO  -  1  quite  different  dif¬ 
ficulty  arising  with  KWIO  indexing  lies  in  the  faot  that  querying  Is 
done  manually  by  scanning  an  output  list.  Onoe  the  output  exoeeds  a 
sise  nuoh  that  It  oan  be  scanned  by  a  human  being  in  a  reasonable  time, 
its  vtu.ua  decreases  significantly.  One  reason  for  this  ohange  in  value 
is  the  problem  of  synoryny.  As  long  as  the  output  is  manageably  small, 
a  user  of  a  KWIO  index  oan  simply  read  through  the  entire  index  and  note 
the  associated  documents  whenever  he  encounters  a  synonym  of  the  descrip¬ 
tor  that  conoemB  him.  He  need  not  think  of  the  synonyms  beforehand, 
oince  ho  will  recognize  them  when  he  sees  them.  Once  the  output  becomes 
too  bulky  to  be  soanned  in  its  entirety,  the  user  must  reBort  to  a 
thesaurus  of  synonyms.  Even  with  such  a  thesaurus  the  largo  number  of 
synonyms  may  make  retrieval  extremely  awkward. 


A  further  difficulty  arises  when  the  desired  documents  belong 
■bo  the  intersection  of  two  or  more  descriptive  categories  ,  Each  of  the 


categories  may  be  quite  large j  yet  their  intersection  may  be  small , 

The  user  must  scan  each  of  the  categories  in  full  to  find  that  small 
set  of  documents  lying  in  the  intersections 

9, 3o3»3  Improving  the  KWIG  System  -  KWIC  is  now  an  operational 
automated  indexing  system,  The  problems  that  have  been  noted  seem  real, 
but  solutions  to  these  problems  are  being  advanced  and  some  are  themselves 
becoming  operational.  The  solutions  that  we  shall  enumerate  run  the  gamut 
of  possible  controls  and  procedures  that  would  affect  an  indexer,  an 
author,  and  a  user. 

At  the  Scientific  and  Technical  Information  Facility  titles  of 
documents  are  expanded  and  elaborated  into  a  notation  of  content  for 
publication  la  t iAft This  notaM  m  of  oubsnt  can.  be  considered  either 
as  an  expanded  title  or  as  a  highly  condensed  abstract.  This  technique 
might  be  considered  the  first  step,  from  the  indexing  point  of  view, 
towards  improving  the  effectiveness  of  titles  for  deriving  indexing 
terms.  It  is  a  fundamental  assumption  of  an  indexing  system  proposed 
by  the  Engineering  Joint  Council  (1,  59]  that  the  author  of  a  technical 
article  can  be  the  most  instrumental  in.  the  one-time  indexing  of  his 
article,  Aa  an  ideal  situation  this  indexing  would  satisfy  all  future 
indexing  requirements  for  that  article, 

Connolly  [12]  discusses  hi*  experience  with  key-terms  or 
content  analysis  appearing  in  Applied  Physics  Letters , 


The  terms  were 


originally  assigned  in  tho  editorial  office  of  the  Jet  Propulsion 
Lilt  oratory,  but  thny  am  now  generally  provided  by  tho  author's  filling 
out  a  form  that  is  sent  to  him  when  his  paper  is  received.  A  combina¬ 
tion  of  the  terms  drawn  from  the  author's  completed  forme  and  from  the 
title  of  the  articles  might  well  overcome  one  major  objection  to  KWIC 
indexing:  that  titles  alone  may  be  inadequate  as  descriptors  of  the 
content  of  a  paper. 

Another  point  of  view  for  including  the  author  in  the  Indexing 
problem  is  noted  by  Brandenburg  £11],  who  states  that  title  writing  must 
balance  machine  requirements  against  human  so arming  habits.  Han-maohine 
requirements  may  oonfliot  with  acceptable  title  length,  significant 
words,  attention-getting  devices,  and  work  forms  for  retrieval.  Similarly, 
Kennedy  [39]  has  enumerated  nine  itepa  for  the  construction  of  good  titles 
for  ultimate  I2JI0  indexing, 

9.3.1*  Other  Problems  in  Scientific  Documentation  -  Some  of  the  dif¬ 
ficulties  In  retrieving  scientific  information  lie  in  the  nature  of  the 
documents  themselves  rathor  than  in  the  desorlptivenoQD  of  titles  and 
index  terms.  This  conclusion  Is  reached  by  the  Weinberg  Report  [70j 
see  also  68,  No,  t»],  which  laments  the  failure  of  scientists  and  engineers 
to  express  themselves  clearly.  It  is  reported  [13]  that  Tufts  University 
is  critically  reviewing  the  literature  and  past  research  on  the  effec¬ 
tiveness  of  teohnical  writing  as  a  means  of  communication.  The  study  is 
concentrated  on  the  variables  in  the  writing  and  graphic  processes  that 
have  some  measureablo  communication  effort  upon  tho  reader. 
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Of  interest  to  the  author  of  scientific  communication  are  the  many 
comments  [25*  82]  that  suggest  that  part  of  today's  problem  of  informa¬ 
tion  retrieval  from  the  sheer  volume  of  literature  and  a  certain  careless¬ 
ness  with  which  scientists  stuff  the  literature  with  their  reports „  Waldo 
[82]  refers  to  a  system  that  replaces  report  writing*  indexing*  and  file 
storage  by  storing  data  on  magnetic  tape  and  by  retrieving  as  necessary 
through  appropriate  questions  to  a  computer 0  A  similar  notion  was  viewed 
by  Hamming  [26]  as  information  regeneration.  He  gives  the  examples 
rather  than  retrieve  the  values  of  trigonometric  functions*  regenerate 
them  as  needed,,  Dubinin  [15]  of  the  USSR  also  suggests  the  use  of  com¬ 
puting  machines  for  storing  information  and  for  retrieving  available 
information  only  upon  demand.  And*  perhaps  as  a  final  extreme*  Shiloh 
[71]  has  suggested  that  the  burden  of  reading  should  bo  lightened  by 
using  other  techniques  of  communication*  in  particular  the  use  of 
intswsc-Aloxval  seminars  „ 

9.3.5  Summary  -  A  great  deal  of  existing  literature  has  been 
examined  in  order  to  discover  existing  sys'  uns  for  organizing  descriptors. 
These  systems  have  included  thesauri  for  treating  synonyms*  various  unit- 
concept  systems*  facet  analysis*  and  role  indicators*  In  addition*  the 
KWIC  system  has  been  investigated.  This  system  is  significant  chiefly 
because  it  is  by  far  the  most  popular  system  in  use  today*  and  for  many 
applications  it  fulfills  the  user's  needs  at  a  lew  cost.  Nevertheless* 
it  has  significant  drawbacks.  Titles  are  often  created  without  anticipat¬ 
ing  their  use  in  KWIC  indexing*  and  these  titles  are  not  always  a  good 
reflection  of  the  content  of  the  articles  to  which  they  are  attached* 


although  this  point  is  still  actively  disputed .  In  addition,  when  a 
list  of  documents  is  too  long  to  be  scanned  conveniently  by  a  human 
being,  difficulties  arise  both  in  searching  for  synonyms  of  a  given 
descriptor  and  in  retrieving  documents  from  the  intersection  of  two  or 
more  large  categories. 

Some  of  the  problems  in  descriptor  organization  and  information 
retrieval  generally  stem  from  the  failure  of  authors  to  express  them¬ 
selves  clearly.  This  difficulty  appears  in  the  form  of  meaningless 
titles  and  in  the  form  of  articles  that  are  difficult  to  index  even 
manually.  Requiring  the  author  to  attach  descriptors  to  his  work  may 
help  to  solve  this  problem,  but  the  probability  of  effective  help  is  low 


9  Jj.  APPENDIX  D  -  Sense  Value  Theory  and  Equivocation  in  Relation 
to  Inferential  Information  Systems 

This  appendix  is  an  illustrative  exposition  of  sense  value  theory. 

Its  primary  intent  is  to  clarify  the  applicability  of  sense  value  theory 
to  the  problem  of  equivocation  and  to  outline  necessary  further  research 
and  development  on  sense  value  theory  in  order  to  render  it  applicable 
to  problems  in  inferential  information  processing.  A  formal  exposition 
of  sense  value  theory  is  contained  in  Sommers  [?Ul  and  Darmstadt  [lUl . 

In  order  to  appreciate  the  relevance  of  sense  value  it  is  necessary  • 
to  understand  the  level  of  language  to  which  sense  value  theory  is 
addressed.  For  the  sake  of  this  discussion,  five  levels  of  language 
may  be  dis criminated t 

(a)  Morphology,  orthography,  or  spelling. 

(h)  Syntax  or  grammar. 

(c)  Sense. 

(d)  Logic,  consistency,  or  inference. 

(e)  Fact,  truth,  or  reference. 

This  description  of  language  levels  is  suggestive  rather  than  precise. 

In  general,  information  systems  are  ultimately  concerned  with  language 
at  level  five  j  that  is,  someone  needs  to  know  the  facts  in  a  given 
field  of  knowledge.  But  an  automated  information  system  cannot,  at 
present,  conceivably  perform  any  empirical  tests  on  the  truth  of  its 
assertions.  Such  verification  is  still  best  left  to  human  performance. 

At  present  we  are  most  Interested  in  developing  processing  capabilities 
at  the  fourth  level  of  language,  and  sense  value  theory  is  primarily 


addressed  to  the  third  level  of  language ,  It  is  important-  to  note, 
however,  that  valid  conclusions  at  higher  levels  of  language  depend  upon 
the  organization  of  assertions  at  lower  levels  of  language u 

The  last  conclusion,  as  well  as  the  language  level  classification, 
is  perhaps  best  understood  in  terms  of  a  specific  example.  Consider  the 
assertion,  ’’John  Smith  is  the  Prime  Minister  of  England."  At  the  factual 
level  we  are  interested  in  the  truth  of  this  assertion.  If,  however,  we 
amend  the  assertion  to  read  "...and  so  is  John  Jones,"  then  we  can  con¬ 
clude  from  considerations  at  the  fourth  (logical)  level  that  the  state- 

■ft 

ment  need  not  be  evaluated  at  the  fifth  (factual)  level. 

A  statement  becomes  inappropriate  for  evaluation  at  the  fourth  level 
if  a  failure  or  error  occurs  at  an  earlier  level.  Thus,  if  we  change  the 
statement  to  say,  "John  Smith  is  a  prime  number,"  then  in  the  ordinary 
sense  of  the  use  of  proper  names  and  of  prime  number  the  statement  simply 
dc-;s  not  make  sense.  It  it  not  a  matter  of  empirical  test  that  people 
are  not  prime  numbers  nor  even  a  function  of  arbitrary  definition  such 
as  that  there  is  only  one  prime  minister.  People  just  are  not  the  sorts 
of  things  that  can  be  prime  numbers,  nor  are  numbers  the  sorts  of  things 
that  can  be  prime  ministers. 

The  last  example,  while  failing  at  the  third  level  of  language —that 
is  failing  to  make  sense-still  is  adequately  formed  at  lower5  levels  of 

’Given  the  knowledge  that  only  one  person  may,  by  definition,  be  prime 
minister  and  that  Smith  and  Jones  are  not  the  same  person,  then  the 
sentence  is  logically  incorrect. 
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language.  Thus  the  grammar  and  orthography  of  the  example  are  impeccable. 
It  is  not  necessary  to  give  examples  of  failures  at  the  syntactic  or 
morphological  level;  they  are  both  obvious  and  outside  the  scope  of 
this  discussion.  It  is  apparent,  however,  that  the  progression  of  cri¬ 
teria  applies  to  the  lower  levels  of  language.  Thus  it  is  pointless  to 
determine  whether  a  combination  of  letters  that  do  not  form  words  in  the 
language  is  grammatical  or  whether  a  combination  of  words  that  is  not  a 
sentence  meets  the  sense  criterion  of  level  three. 

The  observation  that  it  makes  no  sense  to  say  of  some  sorts  of  things — . 
for  example,  people — that  they  are  other  sorts  of  things — for  example, 
prime  numbers --is  central  to  the  theoretical  treatment  of  the  sense  level* 
To  say  that  a  thing  is  a  particular  sort  of  thing  is  to  predicate  some¬ 
thing  of  it.  Some  predicates  may  be  applied  to  the  same  things  and  thus 
may  be  called  copredicable .  The  fundamental  hypothesis  of  sense  value 
theoiy  is  that  if  two  predicates,  say  A  and  B,  are  copredicable,  then 
either  A  is  predicable  of  all  the  things  of  which  B  is  predicable,  or 
else  B  is  predicable  of  all  the  things  of  which  A  is  predicable,  or  both* 
The  last  statement  implies  t  it  for  two  predicates,  A  and  B,  either  all 
the  individuals  or  things  of  which  A  is  predicable  may  also  be  described 
by  B,  or  else  A  may  be  predicated  of  all  the  things  of  which  B  is  pred¬ 
icable,  or  else  there  are  no  individuals  of  which  both  A  and  B  may  be 
predicated. 

The  predicability  relations  between  predicates  is  perhaps  best 
illustrated  graphically  with  a  specific  example.  Figure  13  shows  some 
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FIGURE  13,  Terms  of  a  Language  Disposed  In  Hierarchical  Tree 
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of  the  terms  in  a  language  disposed  in  a  hierarchical  tree.  The 
individuals  or  things  are  underlined  and  are  located  in  the  lowest 
nodes.  The  predicates  are  in  the  higher  nodes.  If  a  predicate  is 
connected  to  an  individual  by  descending  lines  without  any  ascending 
3dnes  intervening,  then  it  is  predicable  of  that  individual.  If  follows 
that  any  pair  of  predicates  connected  by  a  series  of  lines  without  rever-  * 
sals  from  ascending  to  descending,  or  that  are  at  the  same  node,  are 
copredicable .  For  those  at  the  same  node,  the  same  set  of  individuals 
may  be  described.  If  one  predicate  is  higher,  its  scope  is  greater  and 
it  applies  to  more  individuals  than  the  lower  predicate,  but  its  scope 
includes  the  scope  of  the  lower  predicate.  If  two  predicates  cannot  be 
connected  by  a  series  of  lines  without  reversing  direction,  then  their 
scopes  have  no  individuals  in  common.  The  latter  condition  requires  that 
no  more  than  one  descending  line  enter  a  node  5  generally,  more  than  one 
will  leave  it  if  it  is  not  a  bottom  node. 

Some  specific  examples  from  the  tree  may  clarify  these  generalizations. 
The  top  node  of  the  tree  in  this  case  is  filled  by  "interesting.”  One  of 
the  theoi  .-ns  in  the  formal  development  of  sense  value  theory  demonstrates 
that  there  must  always  be  a  single  upper  node  for  any  given  language. 

This  theorem  means  that  there  are  always  some  predicates  that  are  pred¬ 
icable  of  all  individuals.  The  right  hand  node  below  "interesting"  con¬ 
tains  all  color  terms.  It  Is  worth  noting  that  both  "red"  and  "not-red" 
have  the  same  scope  in  sense  value  terms,  even  though  they  will  be 
mutually  exclusive  at  the  factual  level.  This  correspondence  occurs 
because  it  makes  sense  to  describe  a  sky  that  happens  not  to  be  blue  as 
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blue  or  a  book  that  happens  not  to  be  green  as  green.  Notice,  however, 
that  the  tree  has  already  bifurcated  and  that  there  are  some  individuals 
that  cannot  be  described  by  color  predicates — for  example,  "a  speech"  and 
"a  walk."  Thus  "brief"  and  "red"  are  not  copredieable  while  "red"  and 
"heavy"  are. 

* 

For  the  purpose  of  this  discussion,  however,  we  are  not  primarily 
interested  in  mapping  predicability  relations  but  in  the  contribution 
of  sense  value  theory  to  automatic  inferential  processing  via  the  detec¬ 
tion  of  equivocation.  But  it  is  precisely  the  mapping  of  predicability 
relation  that  allows  us  to  detect  equivocation  automatically.  Thus, 
there  is  a  sense  in  which  "a  speech"  might  be  referred  to  as  colored  or 
even  as  "red."  Yet  there  does  not  seem  to  be  any  obvious  sense  in  which 
"a  giraffe"  would  be  described  as  "brief."  If  we  accepted  the  sensibility 
of  a  "red  speech,"  without  taking  into  account  the  new  sense  in  which 
"red"  was  being  used,  then  it  would  be  necessary  to  place  a  descending 
line  from  the  "red"  node  to  the  "speech"  node  in  the  graphic  representa¬ 
tion.  But  this  step  violates  the  fundamental  hypothesis  of  sense  value 
theory.  In  this  case  it  is  easy  to  see  that  the  hypothesis  is  correct 
and  that  it  is  only  apparently  violated  because  "red"  is  being  used  in 
two  senses.  The  resolution  of  the  apparent  difficulty  in  sense  value 
terms  is  to  say  that  there  are  at  least  two  senses  of  "red" — "red  1" 

(color)  and  "red  2"  (politics).  Each  of  these  predicates  could  then  be 
placed  in  its  appropriate  tree  location. 

Let  us  consider  a  specific  set  of  assertions,  their  possible  tree 
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representations,  the  automatic  detection  of  an  equivocation,  and  an 
approach  to  the  automatic  resolution  of  the  equivocation.  Some  of  the 


terms  in  the  example  on  page  will  be  used  to  show  how  the  problem 
of  equivocation  may  result  in  invalid  inference.  The  individuals  to  be 
considered  are; 

Socrates  =  S 
The  number  2  »  N 
A  building  =  B 
The  predicates  are; 

Interesting  «*  I 
Rational  =  R 
Tall  ®  T 


The  possible  predications,  the  only  ones  we  are  likely  to  encounter  in 
sensible  text,  are; 

S  -  I  S-T  S-R 

N  -  I  N  -  R 

B  -  I  B  -  T 


A  graph!.;  representation  of  the  sense  relationships,  ignoring  equivoca¬ 
tion,  is; 


But  this  representation  violates  the  basic  assumption  of  sense  value 
theory %  two  descending  linos  enter  node  S.  Therefore,  we  automatically 
have  evidence  of  an  equivocation.  There  are  three  terms  that,  if  regarded 


as  equivocal,  can  resolve  the  difficulty.  These  terms — S,  R,  and  T — 


lead  to  three  possible  graphic  solutions  consistent  with  sense  value 
theory: 


It  is  intuitively  obvious  that  the  first  two  representations  are 
incorrect  because  "Socrates11  and  "Tall11  have  not  been  used  equivocally 
in  these  assertions.  That  the  third  representation,  which  regards 
"rational"  as  equivocal,  is  indeed  correct  can,  however,  be  concluded 
on  non-intuitive  grounds.  There  exist  both  economic  and  aesthetic 
criteria  that  lead  to  a  correct  conclusion  about  which  term  is  equivo¬ 
cal,  and  these  criteria  can  be  automated.  Thus,  consider  the  problem 
of  adding  new  terms  to  each  of  the  structures.  If  we  wanted  to  add 
"Aristotle"  or  any  other  person  to  the  first  representation,  it  too 
would  have  to  he  regarded  as  equivocal  since  both  "rational"  and  "tall" 
may  be  predicated  of  "Aristotle."  If,  on  the  other  hand,  we  wanted  to 
add  a  predicate  such  as  "heavy"  or  "colored"  to  the  second  representa¬ 
tion,  then  both  of  these  terms  would  have  to  be  made  equivocal.  It  is  only 


the  third  representation  that  can  accommodate  both  additions  without 
increasing  the  number  of  theoretically  necessary  equivocations. 

It  is  possible  to  formulate  appropriate  algorithms  for  automatically 
detecting  and  resolving  equivocation  in  a  corpus .  The  algorithm  would 
assume  that  all  linguistic  work  at  levels  of  language  lower  than  the 
level  of  sense  would  be  supplied — that  is,  at  the  levels  of  syntax  and 
spelling.  Thus  a  computer  program  for  detecting  and  resolving  equivoca¬ 
tion  on  the  basis  of  sense  value  theory  would  assume  an  input  of  individ¬ 
ual  predicate  pairs  distilled  from  the  sentences  of  a  corpus  by  a  previous 
syntactic  processor.  The  program  would  then  detect  any  violations  of  the 
sense  value  hypothesis.  This  function  could  be  done  by  producing  a  machine 
structure  analogous  to  the  graphic  representations  and  checking  fcr  multiple 
descending  entries  into  a  node.  Such  a  representation  is  perhaps  most 
conveniently  developed  in  a  list  processing  system.  Once  a  tree  viola¬ 
tion  had  been  detected,  the  rule  of  economy  could  b->  used  for  the  resolu¬ 
tion  of  equivocation,  Tiiat  is,  tue  berm  that  produces  the  smallest  number 
of  entries  or  equivocations  in  the  tree  representation  when  regarded  as 
two  terms  is  interpreted  as  equivocal. 

In  addition  to  developing  a  partial  model  of  an  automatic  system 
that  can  detect  and  correct  equivocations  by  using  sense  value  theory, 

^Schemes  for  handling  multi termed  predicates  have  also  been  developed. 

One  way  to  deal  with  an  N-termed  predicate  is  as  N  single-termed 
■.  ■  -radicates .  Thus  the  analysis  of  relations  involving  ary  number  of 
individuals  would  be  possible  in  principle  with  a  system  using  individ¬ 
ual  predicate  pairs  as  input. 
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it  would  also  be  desirable  to  verify  the  theory  and  its  applicability 
empirically.  The  essential  questions  are  whether  the  basic  hypothesis 
of  the  theory  as  outlined  is  correct  for  a  substantial  corpus  of  text  or 
sense  value  judgments  and  whether  the  economy  criterion  for  resolving 
equivocation  produces  accurate  results.  Since  syntactic  preprocessing 
is  assumed  for  this  partial  model,  experimental  inputs  can  as  well  be 
developed  from  judgments  about  the  sensibility  of  individual  predicate 
pairs  rather  than  from  an  extensive  search  of  an  information  corpus „ 
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